Skip to main content

Scrapes the main text of web pages while preserving some structure. Seamlessly downloads, parses and converts web documents.

Project description

Python package License Python versions Travis build status Code Coverage
Code:

https://github.com/adbar/trafilatura

Documentation:

see README file

Issue tracker:

https://github.com/adbar/trafilatura/issues

Trafilatura downloads web pages, scrapes main text and comments while preserving some structure, and converts to TXT, XML & TEI-XML. All the operations needed are handled seamlessly.

In a nutshell, with Python:

>>> import trafilatura
>>> downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')
>>> trafilatura.extract(downloaded)
# outputs main content and comments as plain text ...

On the command-line:

$ trafilatura -u "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"
# outputs main content and comments as plain text ...

Description

This library performs a robust extraction which focuses on the main content, which is usually the part displayed centrally, without the left or right bars, the header or the footer, but including potential titles and comments. Trafilatura can seamlessly download, parse and convert web documents. It scrapes the main body text while preserving part of the text formatting and page structure, a task also known as web scraping, boilerplate removal, DOM-based content extraction, main content identification, or web page cleaning.

Distinguishing between whole page and essential parts can help to alleviate many quality problems related to web texts as it can help with the noise consisting of recurring elements (headers and footers, ads, links/blogroll, etc.) It has to be precise enough not to miss texts or discard valid documents, it also has to be reasonably fast, as it is expected to run in production on millions of documents.

Features

  • Seamless download and extraction: URLs, HTML files or parsed HTML trees as input

  • Focus on main text and/or comments

  • Formatting and structural elements preserved: paragraphs, titles, lists, quotes, code, line breaks

  • Extraction of metadata (currently title and date)

  • Output in plain text (minimal formatting) or XML format (for metadata and structure)

  • Computationally efficient (relies on lxml)

  • Robust extraction and generic jusText algorithm used as fallback

Roadmap

  • [-] Duplicate detection at sentence, paragraph and document level using a least recently used (LRU) cache

  • [-] XML output compatible with the recommendations of the Text Encoding Initiative

  • [-] Metadata integration

  • [-] Language detection on the extracted content

  • [ ] Preservation of in-line text formatting (bold, italic, etc.)

Installation

trafilatura is a package compatible with Python 3.5 upwards which is currently tested on Linux and macOS and to some extent on Windows. It is available on the package repository PyPI:

$ pip install trafilatura # pip3 install on systems where both Python 2 and 3 are installed
$ pip install -U trafilatura # to make sure you have the latest version
$ pip install git+https://github.com/adbar/trafilatura.git # latest available code (see build status above)

Additional functions are available with the following extensions:

$ pip install trafilatura[metadata] # metadata extraction
$ pip install trafilatura[all] # all experimental functionality

Experimental functions: language detection, faster processing of downloads, and more efficient deduplication. cchardet package is currently not working on some macOS versions. lru_dict might not work out of the box on Windows.

(For infos on dependency management of Python packages see this discussion thread)

Usage with Python

>>> import trafilatura
>>> downloaded = trafilatura.fetch_url('https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/')
>>> downloaded is None # assuming the download was successful
False
>>> result = trafilatura.extract(downloaded) # trafilatura.process_record is deprecated but works
>>> print(result)
# newlines preserved, TXT output ...
>>> result = trafilatura.extract(downloaded, xml_output=True)
>>> print(result)
# some formatting preserved in basic XML structure ...

The only required argument is the input document (here a downloaded HTML file), the rest is optional.

The inclusion of tables and comments can be deactivated at a function call. The use of a fallback algorithm (currently jusText) can also be bypassed in fast mode:

>>> result = trafilatura.extract(downloaded, include_comments=False) # no comments in output
>>> result = trafilatura.extract(downloaded, include_tables=False) # skip tables examination
>>> result = trafilatura.extract(downloaded, no_fallback=True) # skip justext algorithm used as fallback

This values combined probably provide the fastest execution times:

>>> result = trafilatura.extract(downloaded, include_comments=False, include_tables=False, no_fallback=True)

The input can consist of a previously parsed tree (i.e. a lxml.html object), which is then handled seamlessly:

>>> from lxml import html
>>> mytree = html.fromstring('<html><body><article><p>Here is the main text. It has to be long enough in order to bypass the safety checks. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.</p></article></body></html>')
>>> trafilatura.extract(mytree)
'Here is the main text. It has to be long enough in order to bypass the safety checks. Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua.\n'

Experimental feature: the target language can also be set using 2-letter codes (ISO 639-1), there will be no output if the detected language of the result does not match and no such filtering if the identification component has not been installed (see above for installation instructions).

>>> result = trafilatura.extract(downloaded, url, target_language='de')

All currently available options, along with their default values:

>>>> trafilatura.extract(downloaded, url=None, record_id='0001', no_fallback=False, include_comments=True, xml_output=False, tei_output=False, tei_validation=False, target_language=None, include_tables=True)

For further configuration see the variables in settings.py and re-compile the package locally.

On the command-line

A command-line interface is included, for general instructions see Comment Prompt (tutorial for Windows systems), How to use the Terminal command line in macOS, or An introduction to the Linux Terminal.

URLs can be used directly (-u/--URL):

$ trafilatura -u https://de.creativecommons.org/index.php/was-ist-cc/
$ # outputs main content in plain text format ...
$ trafilatura --xml --URL "https://github.blog/2019-03-29-leader-spotlight-erin-spiceland/"
$ # outputs main text with basic XML structure ...

You can also pipe a HTML document (and response body) to trafilatura:

$ cat myfile.html | trafilatura # use the contents of an already existing file
$ wget -qO- "https://de.creativecommons.org/index.php/was-ist-cc/" | trafilatura # use a custom download

The -i/--inputfile option allows for bulk download and processing of a list of URLs from a file listing one link per line. Beware that there should be a tacit scraping etiquette and that a server may block you after the download a certain number of pages from the same website/domain in a short period of time. In addition, some website may block the requests user-agent. Thus, trafilatura waits a few seconds per default between requests.

For all usage instructions see trafilatura -h:

usage: trafilatura [-h] [-f] [-i INPUTFILE] [--nocomments] [--notables] [--xml] [--xmltei] [-u URL] [-v]

optional arguments:
-h, --help

show this help message and exit

-f, --fast

fast (without fallback detection)

-i INPUTFILE, --inputfile INPUTFILE

name of input file for batch processing

--nocomments

don’t output any comments

--notables

don’t output any table elements

--xml

XML output

--xmltei

XML TEI output

--validate

validate TEI output

-u URL, --URL URL

custom URL download

-v, --verbose

increase output verbosity

Additional information

Trafilatura: Italian word for wire drawing.

Scientific context

This module is part of methods to derive information from web documents in order to build text databases for research (chiefly linguistic analysis and natural language processing). A significant challenge resides in the ability to extract and pre-process web texts to meet scientific expectations: Web corpus construction involves numerous design decisions, and this software packages can help facilitate collection and enhance corpus quality.

https://zenodo.org/badge/DOI/10.5281/zenodo.3460969.svg

Further documentation

To be released soon.

Tutorial video in German by Simon Meier-Vieracker: Content von Webseiten laden mit Trafilatura.

Kudos to…

Alternatives

Most corresponding Python packages are not actively maintained, the following alternatives exist:

  • dragnet features combined and machine-learning approaches, but requires many dependencies as well as extensive tuning

  • goose can extract information for embedded content but doesn’t preserve markup and is not maintained

  • html2text converts HTML pages to Markup language and thus keeps the structure, though it doesn’t focus on main text extraction

  • newspaper is mostly geared towards newspaper texts, provides additional functions but no structured text or comment extraction.

  • python-readability cleans the page and preserves some markup but is mostly geared towards news texts

Contact

Pull requests are welcome.

See this contact page for additional details.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

trafilatura-0.2.1.tar.gz (1.6 MB view hashes)

Uploaded Source

Built Distribution

trafilatura-0.2.1-py3-none-any.whl (29.7 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page