skip to navigation
skip to content

doc_crawler 1.2

Explore a website recursively and download all the wanted documents (PDF, ODT…)

doc_crawler - explore a website recursively and download all the wanted documents (PDF, ODT…).

== Synopsis [--accept=jpe?g$] [--download] [--single-page] [--verbose] http://… [--wait=3] [--no-random-wait] --download-files url.lst [--wait=0] --download-file http://…
python3 -m doc_crawler […] http://…

== Description
_doc_crawler_ can explore a website recursively from a given URL and retrieve, in the
descendant pages, the encountered document files (by default: PDF, ODT, DOC, XLS, ZIP…)
based on regular expression matching (typically against their extension). Documents can be
listed on the standard output or downloaded (with the _--download_ argument).

To address real life situations, activities can be logged (with _--verbose_). +
Also, the search can be limited to one page (with the _--single-page_ argument).

Documents can be downloaded from a given list of URL, that you may have previously
produced using default options of _doc_crawler_ and an output redirection such as: +
`./ http://… > url.lst`

Documents can also be downloaded one by one if necessary (to finish the work), using the
_--download-file_ argument, which makes _doc_crawler_ a tool sufficient by itself to assist you
at every steps.

By default, the program waits a randomly-pick amount of seconds, between 1 and 5, before each
download to avoid being rude toward the webserver it interacts with (and so avoid being
black-listed). This behavior can be disabled (with a _--no-random-wait_ and/or a _--wait=0_

_doc_crawler.py_ works great with Tor : `torsocks http://…`

== Options
Optional regular expression (case insensitive) to keep matching document names.
Example : _--accept=jpe?g$_ will keep all : .JPG, .JPEG, .jpg, .jpeg
Directly downloads found documents if set, output their URL if not.
Limits the search for documents to download to the given URL.
Creates a log file to keep trace of what was done.
Change the default waiting time before each download (page or document).
Example : _--wait=3_ will wait between 1 and 3s before each download. Default is 5.
Stops the random pick up of waiting times. _--wait=_ or default is used.
*--download-files* url.lst::
Downloads each documents which URL are listed in the given file.
Example : _--download-files url.lst_
*--download-file* http://…::
Directly save in the current folder the URL-pointed document.

== Tests
Around 30 _doctests_ are included in _doc_crawler.py_. You can run them with the following
command in the cloned repository root: +
`python3 -m doctest`

Tests can also be launched one by one using the _--test=XXX_ argument: +
`python3 -m doc_crawler --test=download_file`

Tests are successfully passed if nothing is output.

== Requirements
- requests
- yaml

One can install them under Debian using the following command : `apt install python3-requests python3-yaml`

== Author
Simon Descarpentries -

== Ressources
Github repository : +
Pypi repository :

== Support
To support this project, you may consider (even a symbolic) donation via :

== Licence
GNU General Public License v3.0. See LICENCE file for more information.
File Type Py Version Uploaded on Size
doc_crawler-1.2.tar.gz (md5) Source 2018-03-07 6KB