0 purchases
sitecrawl 1.0.5
Simple Python module to crawl a website and extract URLs.
Installation
Using pip:
pip3 install sitecrawl
sitecrawl --help
Or build from sources:
# Clone project
git clone https://github.com/gabfl/sitecrawl && cd sitecrawl
# Installation
pip3 install .
Usage
CLI
sitecrawl --url https://www.yahoo.com/ --depth 2 --max 4 --verbose
->
* Found 4 internal URLs
https://www.yahoo.com
https://www.yahoo.com/entertainment
https://www.yahoo.com/lifestyle
https://www.yahoo.com/plus
* Found 5 external URLs
https://mail.yahoo.com/
https://news.yahoo.com/
https://finance.yahoo.com/
https://sports.yahoo.com/
https://shopping.yahoo.com/
* Skipped 0 URLs
As a module
Basic example:
from sitecrawl import crawl
crawl.base_url = 'https://www.yahoo.com'
crawl.deep_crawl(depth=2)
print('Internal URLs:', crawl.get_internal_urls())
print('External URLs:', crawl.get_external_urls())
print('Skipped URLs:', crawl.get_skipped_urls())
A more detailed example is available in
example.py.
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.