Skip to main content

Explore a website recursively and download all the wanted documents (PDF, ODT…)

Project description

doc_crawler - explore a website recursively and download all the wanted documents (PDF, ODT…).

== Synopsis
doc_crawler.py [--accept=jpe?g$] [--download] [--single-page] [--verbose] http://…
doc_crawler.py [--wait=3] [--no-random-wait] --download-files url.lst
doc_crawler.py [--wait=0] --download-file http://…
or
python3 -m doc_crawler […] http://…

== Description
_doc_crawler_ can explore a website recursively from a given URL and retrieve, in the
descendant pages, the encountered document files (by default: PDF, ODT, DOC, XLS, ZIP…)
based on regular expression matching (typically against their extension). Documents can be
listed on the standard output or downloaded (with the _--download_ argument).

To address real life situations, activities can be logged (with _--verbose_). +
Also, the search can be limited to one page (with the _--single-page_ argument).

Documents can be downloaded from a given list of URL, that you may have previously
produced using default options of _doc_crawler_ and an output redirection such as: +
`./doc_crawler.py http://… > url.lst`

Documents can also be downloaded one by one if necessary (to finish the work), using the
_--download-file_ argument, which makes _doc_crawler_ a tool sufficient by itself to assist you
at every steps.

By default, the program waits a randomly-pick amount of seconds, between 1 and 5, before each
download to avoid being rude toward the webserver it interacts with (and so avoid being
black-listed). This behavior can be disabled (with a _--no-random-wait_ and/or a _--wait=0_
argument).

_doc_crawler.py_ works great with Tor : `torsocks doc_crawler.py http://…`

== Options
*--accept*=_jpe?g$_::
Optional regular expression (case insensitive) to keep matching document names.
Example : _--accept=jpe?g$_ will keep all : .JPG, .JPEG, .jpg, .jpeg
*--download*::
Directly downloads found documents if set, output their URL if not.
*--single-page*::
Limits the search for documents to download to the given URL.
*--verbose*::
Creates a log file to keep trace of what was done.
*--wait*=x::
Change the default waiting time before each download (page or document).
Example : _--wait=3_ will wait between 1 and 3s before each download. Default is 5.
*--no-random-wait*::
Stops the random pick up of waiting times. _--wait=_ or default is used.
*--download-files* url.lst::
Downloads each documents which URL are listed in the given file.
Example : _--download-files url.lst_
*--download-file* http://…::
Directly save in the current folder the URL-pointed document.

== Tests
Around 30 _doctests_ are included in _doc_crawler.py_. You can run them with the following
command in the cloned repository root: +
`python3 -m doctest doc_crawler.py`

Tests can also be launched one by one using the _--test=XXX_ argument: +
`python3 -m doc_crawler --test=download_file`

Tests are successfully passed if nothing is output.

== Requirements
- requests
- yaml

One can install them under Debian using the following command : `apt install python3-requests python3-yaml`

== Author
Simon Descarpentries - https://s.d12s.fr

== Ressources
Github repository : https://github.com/Siltaar/doc_crawler.py +
Pypi repository : https://pypi.python.org/pypi/doc_crawler

== Support
To support this project, you may consider (even a symbolic) donation via : https://liberapay.com/Siltaar

== Licence
GNU General Public License v3.0. See LICENCE file for more information.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_crawler-1.2.tar.gz (6.2 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page