doc crawler 1.2
doc_crawler - explore a website recursively and download all the wanted documents (PDF, ODT…).== Synopsis doc_crawler.py [--accept=jpe?g][−−download][−−single−page][−−verbose]http://…doccrawler.py[−−wait=3][−−no−random−wait]−−download−filesurl.lstdoccrawler.py[−−wait=0]−−download−filehttp://…orpython3−mdoccrawler[…]http://…==DescriptiondoccrawlercanexploreawebsiterecursivelyfromagivenURLandretrieve,inthedescendantpages,theencountereddocumentfiles(bydefault:PDF,ODT,DOC,XLS,ZIP…)basedonregularexpressionmatching(typicallyagainsttheirextension).Documentscanbelistedonthestandardoutputordownloaded(withthe−−downloadargument).Toaddressreallifesituations,activitiescanbelogged(with−−verbose).+Also,thesearchcanbelimitedtoonepage(withthe−−single−pageargument).DocumentscanbedownloadedfromagivenlistofURL,thatyoumayhavepreviouslyproducedusingdefaultoptionsofdoccrawlerandanoutputredirectionsuchas:+‘./doccrawler.pyhttp://…>url.lst‘Documentscanalsobedownloadedonebyoneifnecessary(tofinishthework),usingthe−−download−fileargument,whichmakesdoccrawleratoolsufficientbyitselftoassistyouateverysteps.Bydefault,theprogramwaitsarandomly−pickamountofseconds,between1and5,beforeeachdownloadtoavoidbeingrudetowardthewebserveritinteractswith(andsoavoidbeingblack−listed).Thisbehaviorcanbedisabled(witha−−no−random−waitand/ora−−wait=0argument).doccrawler.pyworksgreatwithTor:‘torsocksdoccrawler.pyhttp://…‘==Options∗−−accept∗=jpe?g_:: Optional regular expression (case insensitive) to keep matching document names. Example : _--accept=jpe?g$_ will keep all : .JPG, .JPEG, .jpg, .jpeg*--download*:: Directly downloads found documents if set, output their URL if not.*--single-page*:: Limits the search for documents to download to the given URL.*--verbose*:: Creates a log file to keep trace of what was done.*--wait*=x:: Change the default waiting time before each download (page or document). Example : _--wait=3_ will wait between 1 and 3s before each download. Default is 5.*--no-random-wait*:: Stops the random pick up of waiting times. _--wait=_ or default is used.*--download-files* url.lst:: Downloads each documents which URL are listed in the given file. Example : _--download-files url.lst_*--download-file* http://…:: Directly save in the current folder the URL-pointed document.== TestsAround 30 _doctests_ are included in _doc_crawler.py_. You can run them with the followingcommand in the cloned repository root: +`python3 -m doctest doc_crawler.py`Tests can also be launched one by one using the _--test=XXX_ argument: +`python3 -m doc_crawler --test=download_file`Tests are successfully passed if nothing is output.== Requirements- requests- yamlOne can install them under Debian using the following command : `apt install python3-requests python3-yaml`== AuthorSimon Descarpentries - https://s.d12s.fr== RessourcesGithub repository : https://github.com/Siltaar/doc_crawler.py +Pypi repository : https://pypi.python.org/pypi/doc_crawler== SupportTo support this project, you may consider (even a symbolic) donation via : https://liberapay.com/Siltaar== LicenceGNU General Public License v3.0. See LICENCE file for more information.
For personal and professional use. You cannot resell or redistribute these repositories in their original state.
There are no reviews.