date-guesser

Extract publication dates from web pages

These details have been verified by PyPI

Maintainers

Colin.Carroll hroberts mediacloud-travis pypt rahulbot

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.6

Project description

A library to extract a publication date from a web page, along with a measure of the accuracy. This was produced as a part of the mediacloud project, in order to accurately extract dates from content.

Quickstart

The date guesser uses both the url and the html to work, and uses some heuristics to decide which of many possible dates might be the best one.

from date_guesser import DateGuesser, Accuracy

guesser = DateGuesser()

# Uses url slugs when available
guess = guesser.guess_date(url='https://www.nytimes.com/2017/10/13/some_news.html',
                           html='<could be anything></could>')

#  Returns a namedtuple with three fields
guess.date      # datetime.datetime(2017, 10, 13, 0, 0, tzinfo=<UTC>)
guess.accuracy  # Accuracy.DATE
guess.method    # 'Found /2017/10/13/ in url'

In case there are two trustworthy sources of dates, date_guesser prefers the more accurate one

html = '''
    <html><head>
    <meta property="article:published" itemprop="datePublished" content="2017-10-13T04:56:54-04:00" />
    </head></html>'''
guess = guesser.guess_date(url='https://www.nytimes.com/2017/10/some_news.html',
                           html=html)
guess.date  # datetime.datetime(2017, 10, 13, 4, 56, 54, tzinfo=tzoffset(None, -14400))
guess.accuracy is Accuracy.DATETIME  # True

But date_guesser is not led astray by more accurate, less trustworthy sources of information

html = '''
    <html><head>
    <meta property="og:image" content="foo.com/2016/7/4/whatever.jpg"/>
    </head></html>'''
guess = guesser.guess_date(url='https://www.nytimes.com/2017/10/some_news.html',
                           html=html)
guess.date  # datetime.datetime(2017, 10, 15, 0, 0, tzinfo=<UTC>)
guess.accuracy is Accuracy.PARTIAL  # True

Installation

The library is not yet available on PyPI, so installation is via github only for now:

pip install git+https://github.com/mitmedialab/date_guesser

Performance

We benchmarked the accuracy against the wonderful newspaper library, using one hundred urls gathered from each of four very different topics in the mediacloud system. This includes blogs and news articles, as well as many urls that have no date (in which case a guess is marked correct only if it returns None).

Vaccines

	date_guesser	newspaper
1 days	57	48
7 days	61	51
15 days	66	53

Aadhar Card in India

	date_guesser	newspaper
1 days	73	44
7 days	74	44
15 days	74	44

Donald Trump in 2017

	date_guesser	newspaper
1 days	79	60
7 days	83	61
15 days	85	61

Recipes for desserts and chocolate

	date_guesser	newspaper
1 days	83	65
7 days	85	69
15 days	87	69

Project details

These details have been verified by PyPI

Maintainers

Colin.Carroll hroberts mediacloud-travis pypt rahulbot

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3
- Python :: 3.6

Release history Release notifications | RSS feed

2.1.4

Aug 13, 2019

2.1.3

Aug 2, 2019

2.1.2

Aug 2, 2019

2.1.1

Jan 27, 2018

2.1.0

Jan 27, 2018

2.0.0

Jan 25, 2018

1.1.0

Jan 16, 2018

This version

1.0.0

Jan 16, 2018

0.0.1

Jan 16, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

date_guesser-1.0.0.tar.gz (10.6 kB view hashes)

Uploaded Jan 16, 2018 Source

Built Distribution

date_guesser-1.0.0-py3-none-any.whl (11.0 kB view hashes)

Uploaded Jan 16, 2018 Python 3

Hashes for date_guesser-1.0.0.tar.gz

Hashes for date_guesser-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`f6389bf9b218871605a00ccfd70c43727a8564b5f3dea90058c28a48be0cb602`
MD5	`bfde2bcac714eb69ccad069457b67fd3`
BLAKE2b-256	`84abe3b2e1fae0e9cbca0e4809b4678177322a6f24b9780ed4b743cfc65c3efc`

Hashes for date_guesser-1.0.0-py3-none-any.whl

Hashes for date_guesser-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`d3d830e2a7ef0ada8d9ef4f1746a69560117d80abf004ae71c302d965202240e`
MD5	`55b7db4e538fa6a66063c50d394d63ba`
BLAKE2b-256	`d1cd5f2fd6e601b48b52ba2a1715afe2c410d1e0c3479acb98cbb8c8ea2ad352`