adblockparser 0.7

Creator: bradpython12

Last updated:

0 purchases

TODO
Add to Cart

Description:

adblockparser 0.7

adblockparser




adblockparser is a package for working with Adblock Plus filter rules.
It can parse Adblock Plus filters and match URLs against them.

Installation
pip install adblockparser
Python 2.7 and Python 3.3+ are supported.
If you plan to use this library with a large number of filters
installing pyre2 library is highly recommended: the speedup
for a list of default EasyList filters can be greater than 1000x.

pip install ‘re2 >= 0.2.21’

Note that pyre2 library requires C++ re2 library installed.
On OS X you can get it using homebrew (brew install re2).


Usage
To learn about Adblock Plus filter syntax check these links:

https://adblockplus.org/en/filter-cheatsheet
https://adblockplus.org/en/filters


Get filter rules somewhere: write them manually, read lines from a file
downloaded from EasyList, etc.:
>>> raw_rules = [
... "||ads.example.com^",
... "@@||ads.example.com/notbanner^$~script",
... ]

Create AdblockRules instance from rule strings:
>>> from adblockparser import AdblockRules
>>> rules = AdblockRules(raw_rules)

Use this instance to check if an URL should be blocked or not:
>>> rules.should_block("http://ads.example.com")
True
Rules with options are ignored unless you pass a dict with options values:
>>> rules.should_block("http://ads.example.com/notbanner")
True
>>> rules.should_block("http://ads.example.com/notbanner", {'script': False})
False
>>> rules.should_block("http://ads.example.com/notbanner", {'script': True})
True


Consult with Adblock Plus docs
for options description. These options allow to write filters that depend
on some external information not available in URL itself.


Performance

Regex engines
AdblockRules class creates a huge regex to match filters that
don’t use options. pyre2 library works better than stdlib’s re
with such regexes. If you have pyre2 installed then AdblockRules
should work faster, and the speedup can be dramatic - more than 1000x
in some cases.
Sometimes pyre2 prints something like
re2/dfa.cc:459: DFA out of memory: prog size 270515 mem 1713850 to stderr.
Give re2 library more memory to fix that:
>>> rules = AdblockRules(raw_rules, use_re2=True, max_mem=512*1024*1024) # doctest: +SKIP
Make sure you are using re2 0.2.20 installed from PyPI, it doesn’t work.


Parsing rules with options
Rules that have options are currently matched in a loop, one-by-one.
Also, they are checked for compatibility with options passed by user:
for example, if user didn’t pass ‘script’ option (with a True or False
value), all rules involving script are discarded.
This is slow if you have thousands of such rules. To make it work faster,
explicitly list all options you want to support in AdblockRules constructor,
disable skipping of unsupported rules, and always pass a dict with all options
to should_block method:
>>> rules = AdblockRules(
... raw_rules,
... supported_options=['script', 'domain'],
... skip_unsupported_rules=False
... )
>>> options = {'script': False, 'domain': 'www.mystartpage.com'}
>>> rules.should_block("http://ads.example.com/notbanner", options)
False
This way rules with unsupported options will be filtered once, when
AdblockRules instance is created.



Limitations
There are some known limitations of the current implementation:

element hiding rules are ignored;
matching URLs against a large number of filters can be slow-ish,
especially if pyre2 is not installed and many filter options are enabled;
match-case filter option is not properly supported (it is ignored);
document filter option is not properly supported;
rules are not validated before parsing, so invalid rules may raise
inconsistent exceptions or silently work incorrectly.

It is possible to remove all these limitations. Pull requests are welcome
if you want to make it happen sooner!


Contributing

source code: https://github.com/scrapinghub/adblockparser
issue tracker: https://github.com/scrapinghub/adblockparser/issues

In order to run tests, install tox and type
tox
from the source checkout.
The license is MIT.



Changes

0.7 (2016-10-17)

Fixed parsing issue with recent easylist.txt;
fixed a link to easylist (thanks https://github.com/limonte).



0.6 (2016-09-10)

Added support for regex rules (thanks https://github.com/mlyko).



0.5 (2016-03-04)

Fixed an issue with blank lines in filter files
(thanks https://github.com/skrypka);
fixed an issue with applying rules with ‘domain’ option
when domain doesn’t have a dot (e.g. ‘localhost’);
Python 2.6 and Python 3.2 support is dropped;
adblockparser likely still work in these interpreters,
but this is no longer checked by tests.



0.4 (2015-03-29)

AdblockRule now caches the compiled regexes (thanks
https://github.com/mozbugbox);
Fixed an issue with “domain” option handling
(thanks https://github.com/nbraem for the bug report and a test case);
cleanups and test improvements.



0.3 (2014-07-11)

Switch to setuptools;
better __repr__ for AdblockRule;
Python 3.4 support is confirmed;
testing improvements.



0.2 (2014-03-20)
This release provides much faster AdblockRules.should_block() method
for rules without options and rules with ‘domain’ option.

better combined regex for option-less rules that makes re2 library
always use DFA without falling back to NFA;
an index for rules with domains;
params method arguments are renamed to options for consistency.



0.1.1 (2014-03-11)
By default AdblockRules autodetects re2 library and uses
it if a compatible version is detected.


0.1 (2014-03-03)
Initial release.

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Customer Reviews

There are no reviews.