htmldate

Find the creation date of web pages using a combination of tree traversal, common structural patterns, text-based heuristics and robust date extraction.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

https://img.shields.io/pypi/v/htmldate.svg

https://img.shields.io/pypi/l/htmldate.svg

https://img.shields.io/pypi/pyversions/htmldate.svg

https://img.shields.io/travis/adbar/htmldate.svg

https://img.shields.io/codecov/c/github/adbar/htmldate.svg

This library finds the creation date of web pages using a combination of tree traversal, common structural patterns, text-based heuristics and robust date extraction. It can handle all the steps needed from web page download to HTML parsing, including scraping and textual analysis. It takes URLs, HTML files or HTML trees as input and outputs a date.

Features

Seamless extraction of the creation or modification date of web pages: given a HTML document, htmldate provides following ways to date it, based on HTML parsing, scraping functions, and robust date parsing:

Starting from the header of the page, it uses common patterns to identify date fields (e.g. link and meta elements) including Open Graph protocol attributes and a large number of CMS idiosyncracies
If this is not successful, it scans the whole document looking for structural markers: abbr/time elements and a series of attributes (e.g. postmetadata)
If no date cue could be found, it finally runs a series of heuristics on the content (text and markup):

in “safe” mode, the HTML page is cleaned and precise expressions are searched for

in the more opportunistic default setting, date expressions are collected and the best one is chosen based on a disambiguation algorithm

The module then returns a date if a valid cue could be found in the document. The output string defaults to ISO 8601 YMD format.

Should be compatible with all common versions of Python 3 (see tests and coverage)
Safety belt included, the output is thouroughly verified with respect to its plausibility and adequateness
Designed to be computationally efficient and is used in production on millions of documents
Handles batch processing of a list of URLs

The library currently focuses on texts in written English or German.

Installation

Install from package repository: pip install htmldate

Direct installation of the latest version over pip is possible (see build status):

pip install git+https://github.com/adbar/htmldate.git

On the command-line

A basic command-line interface is included:

$ htmldate -u http://blog.python.org/2016/12/python-360-is-now-available.html
'2016-12-23'
$ wget -qO- "http://blog.python.org/2016/12/python-360-is-now-available.html" | htmldate
'2016-12-23'

For usage instructions see htmldate -h:

$ htmldate --help
htmldate [-h] [-v] [-s]
optional arguments:
    -h, --help     show this help message and exit
    -v, --verbose  increase output verbosity
    -s, --safe     safe mode: disable extensive search
    -i INPUTFILE, --inputfile INPUTFILE
                         name of input file for batch processing (similar to
                         wget -i)
    -u URL, --URL URL     custom URL download

The batch mode -i takes one URL per line as input and returns one result per line in tab-separated format:

$ htmldate -sv -i list-of-urls.txt

With Python

All the functions of the module are currently bundled in htmldate.

In case the web page features easily readable metadata in the header, the extraction is straightforward. A more advanced analysis of the document structure is sometimes needed:

>>> htmldate.find_date('http://blog.python.org/2016/12/python-360-is-now-available.html')
'# DEBUG analyzing: <h2 class="date-header"><span>Friday, December 23, 2016</span></h2>'
'# DEBUG result: 2016-12-23'
'2016-12-23'

In the worst case, the module resorts to a guess based on a complete screning of the document (extensive_search parameter) which can be deactivated:

>>> htmldate.find_date('https://creativecommons.org/about/')
'2017-08-11' # has been updated since
>>> htmldate.find_date('https://creativecommons.org/about/', extensive_search=False)
>>>

Input format

The module expects strings as shown above, it is also possible to use already parsed HTML (i.e. a LXML tree object):

>>> from lxml import html
>>> mytree = html.fromstring('<html><body><span class="entry-date">July 12th, 2016</span></body></html>')
>>> htmldate.find_date(mytree)
'2016-07-12'

An external module can be used for download, as described in versions anterior to 0.3. This example uses the legacy mode with requests as external module.

>>> import htmldate, requests
>>> r = requests.get('https://creativecommons.org/about/')
>>> htmldate.find_date(r.text)
'2017-11-28' # may have changed since

Date format

The output format of the dates found can be set in a format known to Python’s datetime module, the default being %Y-%m-%d:

>>> htmldate.find_date('https://www.gnu.org/licenses/gpl-3.0.en.html', outputformat='%d %B %Y')
'18 November 2016' # may have changed since

Language-specific analysis

The expected date format can be tweaked to suit particular needs, especially language-specific date expressions, beyond the current scope (English and German): see the init part of core.py as well as the dateparser docs for more information (example setting: dateparser.DateDataParser(settings={'PREFER_DAY_OF_MONTH': 'first', 'PREFER_DATES_FROM': 'past', 'DATE_ORDER': 'DMY'}).

Known caveats

The granularity may not always match the desired output format. If only information about the year could be found and the chosen date format requires to output a month and a day, the result is ‘padded’ to be located at the middle of the year, in that case the 1st of January.

Besides, there are pages for which no date can be found, ever:

>>> r = requests.get('https://example.com')
>>> htmldate.find_date(r.text)
>>>

Tests

A series of webpages triggering different structural and content patterns is included for testing purposes:

$ python tests/unit_tests.py

For more comprehensive tests tox is also an option (see tox.ini).

Additional information

Context

This module is part of methods to derive metadata from web documents in order to build text corpora for computational linguistic and NLP analysis, the original problem being that there are web pages for which neither the URL nor the server response provide a reliable way to date the document, i.e. find when it was first published and/or last modified. For more information:

Barbaresi, Adrien. “Efficient construction of metadata-enhanced web corpora”, Proceedings of the 10th Web as Corpus Workshop (WAC-X), 2016.

Kudos to…

lxml
ciso8601
dateparser (although it’s is still a bit slow)
A few patterns are derived from python-goose, metascraper, newspaper and articleDateExtractor. This module extends their coverage and robustness significantly.

Going further

If the date is nowhere to be found, it might be worth considering carbon dating the web page, however this is computationally expensive.

Pull requests are welcome.

Contact

See my contact page for details.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.8.1

Apr 11, 2024

1.8.0

Mar 19, 2024

1.7.0

Jan 17, 2024

1.6.1

Jan 2, 2024

1.6.0

Nov 21, 2023

1.5.2

Oct 9, 2023

1.5.1

Sep 5, 2023

1.5.0

Aug 28, 2023

1.4.3

May 3, 2023

1.4.2

Mar 20, 2023

1.4.1

Jan 9, 2023

1.4.0

Nov 28, 2022

1.3.2

Oct 14, 2022

1.3.1

Aug 26, 2022

1.3.0

Jul 20, 2022

1.2.3

Jun 16, 2022

1.2.2

Jun 13, 2022

1.2.1

Mar 23, 2022

1.2.0

Mar 16, 2022

1.1.1

Mar 3, 2022

1.1.0

Feb 18, 2022

1.0.1

Feb 14, 2022

1.0.0

Nov 9, 2021

0.9.1

Sep 24, 2021

0.9.0

Jun 9, 2021

0.8.1

Mar 9, 2021

0.8.0

Feb 11, 2021

0.7.3

Jan 4, 2021

0.7.2

Oct 20, 2020

0.7.1

Sep 14, 2020

0.7.0

Jul 29, 2020

0.6.3

May 26, 2020

0.6.2

Mar 19, 2020

0.6.1

Jan 16, 2020

0.6.0

Jan 3, 2020

0.5.6

Sep 24, 2019

0.5.5

Sep 16, 2019

0.5.3

Aug 9, 2019

0.5.2

Jul 17, 2019

This version

0.5.1

Jun 5, 2019

0.5.0

May 6, 2019

0.4.1

Feb 15, 2019

0.4.0

Feb 12, 2019

0.3.4

Feb 4, 2019

0.3.3

Jun 26, 2018

0.3.2

Jun 22, 2018

0.3.1

Dec 13, 2017

0.3.0

Nov 6, 2017

0.2.2

Oct 9, 2017

0.2.1

Sep 11, 2017

0.2.0

Sep 7, 2017

0.1.2

Sep 4, 2017

0.1.1

Sep 1, 2017

0.1.0

Aug 25, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

htmldate-0.5.1-py2.py3-none-any.whl (22.0 kB view hashes)

Uploaded Jun 5, 2019 Python 2 Python 3

Hashes for htmldate-0.5.1-py2.py3-none-any.whl

Hashes for htmldate-0.5.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`544663a25e70734b4613f6a9800a3d8670adcd0b126779071c6b3f4fa4d88ca9`
MD5	`79b7a8f256bd088c866f7973d3c03830`
BLAKE2b-256	`881eb5795c368dc661d7d81d8eca9ba7f360e540859a15a96c9b6e559a26f241`