doc_crawler

Explore a website recursively and download all the wanted documents (PDF, ODT…)

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
- End Users/Desktop
License
- OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Internet :: WWW/HTTP

Project description

doc_crawler - explore a website recursively and download all the wanted documents (PDF, ODT…).

== Synopsis
doc_crawler.py [--accept=jpe?g$] [--download] [--single-page] [--verbose] http://…
doc_crawler.py [--wait=3] [--no-random-wait] --download-files url.lst
doc_crawler.py [--wait=0] --download-file http://…
or
python3 -m doc_crawler […] http://…

== Description
_doc_crawler_ can explore a website recursively from a given URL and retrieve, in the
descendant pages, the encountered document files (by default: PDF, ODT, DOC, XLS, ZIP…)
based on regular expression matching (typically against their extension). Documents can be
listed on the standard output or downloaded (with the _--download_ argument).

To address real life situations, activities can be logged (with _--verbose_). +
Also, the search can be limited to one page (with the _--single-page_ argument).

Documents can be downloaded from a given list of URL, that you may have previously
produced using default options of _doc_crawler_ and an output redirection such as: +
`./doc_crawler.py http://… > url.lst`

Documents can also be downloaded one by one if necessary (to finish the work), using the
_--download-file_ argument, which makes _doc_crawler_ a tool sufficient by itself to assist you
at every steps.

By default, the program waits a randomly-pick amount of seconds, between 1 and 5, before each
download to avoid being rude toward the webserver it interacts with (and so avoid being
black-listed). This behavior can be disabled (with a _--no-random-wait_ and/or a _--wait=0_
argument).

_doc_crawler.py_ works great with Tor : `torsocks doc_crawler.py http://…`

== Options
*--accept*=_jpe?g$_::
Optional regular expression (case insensitive) to keep matching document names.
Example : _--accept=jpe?g$_ will keep all : .JPG, .JPEG, .jpg, .jpeg
*--download*::
Directly downloads found documents if set, output their URL if not.
*--single-page*::
Limits the search for documents to download to the given URL.
*--verbose*::
Creates a log file to keep trace of what was done.
*--wait*=x::
Change the default waiting time before each download (page or document).
Example : _--wait=3_ will wait between 1 and 3s before each download. Default is 5.
*--no-random-wait*::
Stops the random pick up of waiting times. _--wait=_ or default is used.
*--download-files* url.lst::
Downloads each documents which URL are listed in the given file.
Example : _--download-files url.lst_
*--download-file* http://…::
Directly save in the current folder the URL-pointed document.

== Tests
Around 30 _doctests_ are included in _doc_crawler.py_. You can run them with the following
command in the cloned repository root: +
`python3 -m doctest doc_crawler.py`

Tests can also be launched one by one using the _--test=XXX_ argument: +
`python3 -m doc_crawler --test=download_file`

Tests are successfully passed if nothing is output.

== Requirements
- requests
- yaml

One can install them under Debian using the following command : `apt install python3-requests python3-yaml`

== Author
Simon Descarpentries - https://s.d12s.fr

== Ressources
Github repository : https://github.com/Siltaar/doc_crawler.py +
Pypi repository : https://pypi.python.org/pypi/doc_crawler

== Support
To support this project, you may consider (even a symbolic) donation via : https://liberapay.com/Siltaar

== Licence
GNU General Public License v3.0. See LICENCE file for more information.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 5 - Production/Stable
Intended Audience
- Developers
- End Users/Desktop
License
- OSI Approved :: GNU General Public License v3 or later (GPLv3+)
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Internet :: WWW/HTTP

Release history Release notifications | RSS feed

This version

1.2

Mar 7, 2018

1.1

Aug 30, 2017

1.0

Aug 29, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

doc_crawler-1.2.tar.gz (6.2 kB view hashes)

Uploaded Mar 7, 2018 Source

Hashes for doc_crawler-1.2.tar.gz

Hashes for doc_crawler-1.2.tar.gz
Algorithm	Hash digest
SHA256	`148a2f660520a6334ebc6c19721776642dd458288fb091cd4e42554cb0d8453c`
MD5	`4a9ad71302fffd7a30901eefe1caa3a8`
BLAKE2b-256	`c61599098901d30e2d055c138be7d594ab14794bc3475bd0713bcc8c0df305b3`