noahong 0.11.2
Python Aho-Corasick implementation
noahong is a Python implementation of the Aho-Corasick algorithm for string matching, based on a fork of the NoAho C++ implementation.
noahong supports macOS, Linux, Windows on Python 3.6+.
API
The first thing to do is to instantiate a NoAho object and add some keys to it (optionally with different payloads for each).
from noahong import NoAho
trie = NoAho()
# fill with .add()
trie.add("foo", "id_foo")
trie.add("foobar", "id_foobar")
# or fill with __setitem__
trie["bar"] = "id_bar"
Once you have added the different keys and their payloads, the NoAho object needs to be compiled:
trie.compile()
Once it is compiled, keys can no longer be added to the NoAho object.
noahong then exposes four functions to find matching substrings in text:
find_short
trie.find_short(text) finds the first substring of text that is matched by a key added to the trie.
It returns a tuple (start, stop, payload) such that:
payload is the object inserted with trie.add()
start and stop are indices of the match in the text: text[start:stop] == key
For example, using the above trie:
trie.find_short("something foo")
# returns (10, 13, 'id_foo')
# "something foo"[10:13] == "foo"
and returns the first match even though a longer match may start at the same position:
trie.find_short("something foobar")
# returns (10, 13, 'id_foo')
find_long
trie.find_long(text) finds the first longest substring of text that is matched by a key added to the trie.
For example, using the above trie:
trie.find_long("something foobar")
# returns (10, 16, 'id_foobar')
findall_*
Both find_short and find_long have a findall_short and findall_long counterparts that allow you to iterate on all non-overlapping matches found
in the text:
for x in trie.findall_long("something foo bar foobar"):
print(x)
# prints
# (10, 13, 'id_foo')
# (14, 17, 'id_bar')
# (18, 24, 'id_foobar')
Because matches are non-overlapping:
list(trie.findall_short("foobar")) == [(0, 3, "id_foo"), (3, 6, "id_bar")]
whereas:
list(trie.findall_long("foobar")) == [(0, 6, "id_foobar")]
Payloads
NoAho tries accept any Python object as a payload:
trie = NoAho()
trie.add("foo", 0)
trie.add("bar", CustomClass())
trie.add("baz", lambda x: x)
The same payload can be associated with different keys.
Notice that the non-pickable lambda x: x payload works because
there is no serialization involved here.
Length and inclusion
NoAho trie objects also expose the number of keys with len:
len(trie)
And, when they are compiled, they can be used to test for key inclusion:
"foo" in trie
The number of nodes in the underlying Trie can be recovered with
trie.nodes_count()
Mapped NoAho
In order to save memory, noahong exposes a Mapped matching object which can be written to disk and later loaded directly to memory to perform matches with a smaller memory footprint.
The Mapped object exposes different finding methods and only supports integer payloads.
Construct it by adding keys and payloads to a NoAho object:
from noahong import NoAho, Mapped
trie = NoAho()
trie.add("baz", "id_baz")
trie.compile()
trie.write("./test.matcher")
mapped_trie = Mapped("./test.matcher")
The mapped_trie object exposes a findall_anchored function that iterates over anchored matches, matches that can be found within boundaries set with a special "anchor" character \u001F.
This is useful to restrict matches to be found only between, say, spaces:
trie = NoAho()
trie.add("foo", 0)
trie.add("bar", 1)
trie.compile()
trie.write("./test.matcher")
mapped_trie = Mapped("./test.matcher")
mapped_trie.findall_anchored("\u001Fbar\u001F\u001Ffoo\u001F\u001Ffoobar\u001F")
# returns [(1, 4, 1), (6, 9, 0), (11, 14, 0)]
Notice how "bar" is not found in the final "foobar" because it is not present between "anchor" characters.
It is possible to place anchor characters in the keys:
trie = NoAho()
trie.add("foo\u001F\u001Fbar", 0)
trie.add("foo", 1)
trie.add("bar", 2)
trie.compile()
trie.write("./test.matcher")
mapped_trie = Mapped("./test.matcher")
mapped_trie.findall_anchored("\u001Ffoo\u001F\u001Fbar\u001F")
# returns [(1, 9, 0)]
In this case, the longest key found between anchors is returned.
Installation
Devpi
noahong is available on devpi:
pip install noahong
Python 3
noahong can be installed manually:
python3 setup.py install
Legacy README
You can find more information on the package and C++ implementation by reading the
legacy README found here.
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.