doc_crawler 1.2

Creator: bradpython12

Last updated:

Add to Cart

Description:

doc crawler 1.2

doc_crawler - explore a website recursively and download all the wanted documents (PDF, ODT…).== Synopsis doc_crawler.py [--accept=jpe?g][−−download][−−single−page][−−verbose]http://…doccrawler.py[−−wait=3][−−no−random−wait]−−download−filesurl.lstdoccrawler.py[−−wait=0]−−download−filehttp://…orpython3−mdoccrawler[…]http://…==DescriptiondoccrawlercanexploreawebsiterecursivelyfromagivenURLandretrieve,inthedescendantpages,theencountereddocumentfiles(bydefault:PDF,ODT,DOC,XLS,ZIP…)basedonregularexpressionmatching(typicallyagainsttheirextension).Documentscanbelistedonthestandardoutputordownloaded(withthe−−downloadargument).Toaddressreallifesituations,activitiescanbelogged(with−−verbose).+Also,thesearchcanbelimitedtoonepage(withthe−−single−pageargument).DocumentscanbedownloadedfromagivenlistofURL,thatyoumayhavepreviouslyproducedusingdefaultoptionsofdoccrawlerandanoutputredirectionsuchas:+‘./doccrawler.pyhttp://…>url.lst‘Documentscanalsobedownloadedonebyoneifnecessary(tofinishthework),usingthe−−download−fileargument,whichmakesdoccrawleratoolsufficientbyitselftoassistyouateverysteps.Bydefault,theprogramwaitsarandomly−pickamountofseconds,between1and5,beforeeachdownloadtoavoidbeingrudetowardthewebserveritinteractswith(andsoavoidbeingblack−listed).Thisbehaviorcanbedisabled(witha−−no−random−waitand/ora−−wait=0argument).doccrawler.pyworksgreatwithTor:‘torsocksdoccrawler.pyhttp://…‘==Options∗−−accept∗=jpe?g_:: Optional regular expression (case insensitive) to keep matching document names. Example : _--accept=jpe?g$_ will keep all : .JPG, .JPEG, .jpg, .jpeg*--download*:: Directly downloads found documents if set, output their URL if not.*--single-page*:: Limits the search for documents to download to the given URL.*--verbose*:: Creates a log file to keep trace of what was done.*--wait*=x:: Change the default waiting time before each download (page or document). Example : _--wait=3_ will wait between 1 and 3s before each download. Default is 5.*--no-random-wait*:: Stops the random pick up of waiting times. _--wait=_ or default is used.*--download-files* url.lst:: Downloads each documents which URL are listed in the given file. Example : _--download-files url.lst_*--download-file* http://…:: Directly save in the current folder the URL-pointed document.== TestsAround 30 _doctests_ are included in _doc_crawler.py_. You can run them with the followingcommand in the cloned repository root: +`python3 -m doctest doc_crawler.py`Tests can also be launched one by one using the _--test=XXX_ argument: +`python3 -m doc_crawler --test=download_file`Tests are successfully passed if nothing is output.== Requirements- requests- yamlOne can install them under Debian using the following command : `apt install python3-requests python3-yaml`== AuthorSimon Descarpentries - https://s.d12s.fr== RessourcesGithub repository : https://github.com/Siltaar/doc_crawler.py +Pypi repository : https://pypi.python.org/pypi/doc_crawler== SupportTo support this project, you may consider (even a symbolic) donation via : https://liberapay.com/Siltaar== LicenceGNU General Public License v3.0. See LICENCE file for more information.

License

For personal and professional use. You cannot resell or redistribute these repositories in their original state.

Files:

Customer Reviews

There are no reviews.