Last updated:
0 purchases
pydomainextractor 0.13.9
A blazingly fast domain extraction library written in Rust
Table of Contents
Table of Contents
About The Project
Built With
Performance
Extract From Domain
Extract From URL
Installation
Usage
Extraction
URL Extraction
Validation
TLDs List
License
Contact
About The Project
PyDomainExtractor is a Python library designed to parse domain names quickly.
In order to achieve the highest performance possible, the library was written in Rust.
Built With
AHash
idna
memchr
once_cell
Public Suffix List
Performance
Extract From Domain
Tests were run on a file containing 10 million random domains from various top-level domains (Mar. 13rd 2022)
Library
Function
Time
PyDomainExtractor
pydomainextractor.extract
1.50s
publicsuffix2
publicsuffix2.get_sld
9.92s
tldextract
__call__
29.23s
tld
tld.parse_tld
34.48s
Extract From URL
The test was conducted on a file containing 1 million random urls (Mar. 13rd 2022)
Library
Function
Time
PyDomainExtractor
pydomainextractor.extract_from_url
2.24s
publicsuffix2
publicsuffix2.get_sld
10.84s
tldextract
__call__
36.04s
tld
tld.parse_tld
57.87s
Installation
pip3 install PyDomainExtractor
Usage
Extraction
import pydomainextractor
# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()
domain_extractor.extract('google.com')
>>> {
>>> 'subdomain': '',
>>> 'domain': 'google',
>>> 'suffix': 'com'
>>> }
# Loads a custom SuffixList data. Should follow PublicSuffixList's format.
domain_extractor = pydomainextractor.DomainExtractor(
'tld\n'
'custom.tld\n'
)
domain_extractor.extract('google.com')
>>> {
>>> 'subdomain': 'google',
>>> 'domain': 'com',
>>> 'suffix': ''
>>> }
domain_extractor.extract('google.custom.tld')
>>> {
>>> 'subdomain': '',
>>> 'domain': 'google',
>>> 'suffix': 'custom.tld'
>>> }
URL Extraction
import pydomainextractor
# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()
domain_extractor.extract_from_url('http://google.com/')
>>> {
>>> 'subdomain': '',
>>> 'domain': 'google',
>>> 'suffix': 'com'
>>> }
Validation
import pydomainextractor
# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()
domain_extractor.is_valid_domain('google.com')
>>> True
domain_extractor.is_valid_domain('domain.اتصالات')
>>> True
domain_extractor.is_valid_domain('xn--mgbaakc7dvf.xn--mgbaakc7dvf')
>>> True
domain_extractor.is_valid_domain('domain-.com')
>>> False
domain_extractor.is_valid_domain('-sub.domain.com')
>>> False
domain_extractor.is_valid_domain('\xF0\x9F\x98\x81nonalphanum.com')
>>> False
TLDs List
import pydomainextractor
# Loads the current supplied version of PublicSuffixList from the repository. Does not download any data.
domain_extractor = pydomainextractor.DomainExtractor()
domain_extractor.get_tld_list()
>>> [
>>> 'bostik',
>>> 'backyards.banzaicloud.io',
>>> 'biz.bb',
>>> ...
>>> ]
License
Distributed under the MIT License. See LICENSE for more information.
Contact
Gal Ben David - [email protected]
Project Link: https://github.com/Intsights/PyDomainExtractor
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.