Last updated:
0 purchases
boyermoore 1.0.0
Table Of Contents
Boyer-Moore in pure python: search for unicode strings quickly in large files
Installing
Searching for all occurences of a substring in a file
Searching for the first occurence of a substring in a file
Performance / Speed test
Test environment
Test methodology
Test results
Contributions
Boyer-Moore in pure python: search for unicode strings quickly in large files
This is an implementation of the Boyer-Moore substring search algorithm in pure python.
It is a shameless copy-paste of the python reference code provided here,
with modifications to support the following additional features:
Searching in files without reading the whole file into memory, allowing handling of large files
Full unicode support
See the API documentation for more details.
Installing
Install from pip.
pip install boyermoore
Searching for all occurences of a substring in a file
>>> from boyermoore import search_file
>>>
>>> offsets = search_file("pattern!", "file.txt") # Find all occurrences of "pattern!" in file "file.txt"
>>> offsets # Display found occurrences
[12, 456, 10422] # Pattern occurs at byte offsets 12, 456, and 104222
Searching for the first occurence of a substring in a file
>>> from boyermoore import search_file
>>>
>>> offsets = search_file("pattern!", "file.txt", greedy=False) # Find the first occurrence of "pattern!" in file "file.txt"
>>> offsets # Display found occurrences
[12] # First occurrence of pattern is at byte offset 12
Performance / Speed test
The following section illustrates the average speed of the boyermoore.search_file
function when searching for a unicode string in files of sizes ranging from 32MB to 4GB.
Test environment
The test was executed using Python 3.7.6 on a Windows 10 system with an Intel(R) Core(TM) i7-8700K CPU @ 3.70GHz
and 32 GB of RAM.
Test methodology
The test searches for all occurrences of a fixed unicode string in a series of test files.
The unicode string is:
Hello नमस्ते Привет こんにちは
(“Hello” in English, followed by the Hindi translation, followed by the Russian translation,
followed by the Japanese translation)
Each test file has 2 occurrences of the unicode string, one at the very beginning (byte offset of 0)
and one at the very end (byte offset of [file_length - pattern_length]).
Test results
The following table shows the times taken to search for all occurences of the unicode
string “Hello नमस्ते Привет こんにちは” inside test files of various sizes.
File size
Time (seconds)
32 MB
0.31
64 MB
0.55
128 MB
1.07
256 MB
1.96
512 MB
3.87
1 GB
7.56
4 GB
31.01
Contributions
Contributions are welcome, please open a pull request at https://github.com/eriknyquist/boyermoore and ensure that:
All existing unit tests pass (run tests via python setup.py test)
New unit tests are added to cover any modified/new functionality (run python code_coverage.py
to ensure that coverage is above 98%)
You will need to install packages required for development, these are listed in dev_requirements.txt:
pip install -r dev_requirements.txt
If you have any questions about / need help with contributions or unit tests, please
contact Erik at [email protected].
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.