Skip to main content

Seamlessly extract the creation or modification date of web pages by scraping the HTML code or performing content guesses.

Project description

https://img.shields.io/pypi/v/htmldate.svg https://img.shields.io/pypi/l/htmldate.svg https://img.shields.io/pypi/pyversions/htmldate.svg https://img.shields.io/travis/adbar/htmldate.svg https://codecov.io/gh/adbar/htmldate/branch/master/graph/badge.svg

Description

Seamless extraction of the creation or modification date of web pages. htmldate provides following ways to date documents, based on HTML parsing and scraping functions:

  1. Starting from the header of the page, it uses common patterns to identify date fields.

  2. If this is not successful, it scans the whole document looking for structural markers.

  3. If no date cue could be found, it finally runs a series of heuristics on the content (text and markup).

The module takes the HTML document as input (string format) and returns a date if a valid cue could be found in the document. The output string defaults to ISO 8601 YMD format.

It should be compatible with all common versions of Python (see tests and coverage).

Installation

Install from package repository: pip install htmldate

Direct installation of the latest version over pip is possible (see build status):

pip install git+https://github.com/adbar/htmldate.git

On the command-line

A basic command-line interface is included:

$ wget -qO- "http://blog.python.org/2016/12/python-360-is-now-available.html" | htmldate
2016-12-23

For usage instructions see:

$ htmldate --help
htmldate [-h] [-v] [-s]
optional arguments:
    -h, --help     show this help message and exit
    -v, --verbose  increase output verbosity
    -s, --safe     safe mode: markup search only
    -i INPUTFILE, --inputfile INPUTFILE
                   name of input file for batch processing

The batch mode -i is similar to wget -i, it takes one URL per line as input and returns one result per line in tab-separated format.

Within Python

All the functions of the module are currently bundled in htmldate.

In case the web page features easily readable metadata in the header, the extraction is straightforward. A more advanced analysis of the document structure is sometimes needed:

>>> htmldate.find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
'# DEBUG analyzing: <h2 class="date-header"><span>Friday, December 23, 2016</span></h2>'
'# DEBUG result: 2016-12-23'
'2016-12-23'

In the worst case, the module resorts to a guess based on an extensive search, which can be deactivated:

>>> htmldate.find_date('https://creativecommons.org/about/')
'2017-08-11' # has been updated since
>>> htmldate.find_date(r.text, extensive_search=False)
>>>

Input format

The module expects strings as shown above, it is also possible to use already parsed HTML (i.e. a LXML tree object):

>>> from lxml import html
>>> mytree = html.fromstring('<html><body><span class="entry-date">July 12th, 2016</span></body></html>')
>>> htmldate.find_date(mytree)
'2016-07-12'

An external module can be used for download, as described in versions anterior to 0.3. This example uses the legacy mode with requests as external module.

>>> import requests
>>> import htmldate
>>> r = requests.get('https://www.theguardian.com/politics/2016/feb/17/merkel-eu-uk-germany-national-interest-cameron-justified')
>>> htmldate.find_date(r.text)

Date format

The output format of the dates found can be set in a format known to Python’s datetime module, the default being %Y-%m-%d:

>>> htmldate.find_date('https://www.gnu.org/licenses/gpl-3.0.en.html', outputformat='%d %B %Y')
'18 November 2016'

Language-specific

The expected date format can be tweaked to suit particular needs, especially language-specific date expressions:

>>> htmldate.find_date(r.text, dparser=dateparser_object) # like dateparser.DateDataParser(settings={'PREFER_DAY_OF_MONTH': 'first', 'PREFER_DATES_FROM': 'past', 'DATE_ORDER': 'DMY'}

See the init part of core.py as well as the dateparser docs for more information.

Known caveats

The granularity may not always match the desired output format. If only information about the year could be found and the chosen date format requires to output a month and a day, the result is ‘padded’ to be located at the middle of the year, in that case the 1st of July.

Besides, there are pages for which no date can be found, ever:

>>> r = requests.get('https://example.com')
>>> htmldate.find_date(r.text)
>>>

Tests

A series of webpages triggering different structural and content patterns is included for testing purposes:

$ python tests/unit_tests.py

For more comprehensive tests tox is also an option (see tox.ini).

Additional information

Context

There are web pages for which neither the URL nor the server response provide a reliable way to date the document, i.e. find when it was first published and/or last modified.

This module is part of methods to derive metadata from web documents in order to build text corpora for (computational) linguistic analysis. For more information:

Kudos to…

Further analyses

If the date is nowhere to be found, it might be worth considering carbon dating the web page, however this is computationally expensive.

Pull requests are welcome.

Contact

See my contact page for details.

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

htmldate-0.3.0.tar.gz (735.5 kB view hashes)

Uploaded Source

Built Distribution

htmldate-0.3.0-py2.py3-none-any.whl (17.9 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page