pyhyperminhash 0.1.3

Creator: bradpython12

Last updated:

Add to Cart

Description:

pyhyperminhash 0.1.3

Hyperminhash for Python


Very fast, constant memory-footprint cardinality approximation, including intersection
and union operation.
The class Sketch can be used to count unique elements that were encountered during
the instance's lifetime. Unlike e.g. when using a set, the memory consumed by the
Sketch-instance does not grow as elements are added; each Sketch-instance
consumes approximately 32kb of memory, independent of the number of elements.
Adding new elements to a Sketch is preferably done using .add_bytes(), akin
to .digest() for hashlib-objects. It is also possible to use .add() if a bytes-like
object can't be provided; the object's hash() is then used to provide a unique
identifier for that object.
# Construct an empty Sketch
sk = pyhyperminhash.Sketch()
assert not bool(sk) # The Sketch is empty
sk.add_bytes(b'Foo') # Add elements
sk.add_bytes(b'Bar')
sk.add(42)
assert bool(sk) # The Sketch is not empty
assert float(sk) == sk.cardinality() # Approximately 3.0
assert int(sk) == len(sk) # Approximately 3

sk2 = pyhyperminhash.Sketch()
sk2.add_bytes(b'Foobar')
sk2 &= sk1 # sk2 now holds all elements that were in `sk` or `sk2`

sk3 = pyhyperminhash.Sketch()
sk3.add(42)
sk3.add_bytes(b'Foo')
assert sk3.intersection(sk1) > 0.0 # Approximately 1.0, due to `Foo`

buf = sk.save() # Serialize the Sketch into a bytes-buffer
new_sk = pyhyperminhash.Sketch.load(buf) # Deserialize a Sketch
assert len(new_sk) == len(sk)

# Construct a Sketch from any iterator directly.
# This is faster than adding elements one by one.
sk = pyhyperminhash.Sketch.from_iter(iter(range(100)))

with open('complaints.txt', 'rb') as f:
sk = pyhyperminhash.Sketch.from_iter_bytes(f)

Although convenient, using .add() over .add_bytes() has two downsides:

It may be slower, as the object's own hash() may have to be computed
The Python interpreter probably randomizes hash-values at interpreter-startup,
in order to prevent certain DOS-attacks. The randomness introduced here means that
the count-approximation provided by this package may not be stable from run to run.

See the documentation for the underlying implementation for additional information.
Performance examples
Reading a file of 55 million lines, 725.940 unique elements, 100 characters per line on average.



Method
Wall-clock time
Memory consumed
Count




no work
9.5 seconds
nil
nil


Sketch.from_iter_bytes()
10.13 seconds
~32 kilobytes
734,628


set()
14.99 seconds
~164 megabytes
725,940



FAQ

Can I extract an element that was previously added from a Sketch?

No.


Can I check if an element has previously been added to a Sketch?

No.


Wow, why use this, then?

It's very fast at counting unique things. And it does not use much memory.

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.