Skip to main content

Extract publication dates from web pages

Project description

Build Status Coverage

A library to extract a publication date from a web page, along with a measure of the accuracy. This was produced as a part of the mediacloud project, in order to accurately extract dates from content.

Quickstart

The date guesser uses both the url and the html to work, and uses some heuristics to decide which of many possible dates might be the best one.

from date_guesser import DateGuesser, Accuracy

guesser = DateGuesser()

# Uses url slugs when available
guess = guesser.guess_date(url='https://www.nytimes.com/2017/10/13/some_news.html',
                           html='<could be anything></could>')

#  Returns a namedtuple with three fields
guess.date      # datetime.datetime(2017, 10, 13, 0, 0, tzinfo=<UTC>)
guess.accuracy  # Accuracy.DATE
guess.method    # 'Found /2017/10/13/ in url'

In case there are two trustworthy sources of dates, date_guesser prefers the more accurate one

html = '''
    <html><head>
    <meta property="article:published" itemprop="datePublished" content="2017-10-13T04:56:54-04:00" />
    </head></html>'''
guess = guesser.guess_date(url='https://www.nytimes.com/2017/10/some_news.html',
                           html=html)
guess.date  # datetime.datetime(2017, 10, 13, 4, 56, 54, tzinfo=tzoffset(None, -14400))
guess.accuracy is Accuracy.DATETIME  # True

But date_guesser is not led astray by more accurate, less trustworthy sources of information

html = '''
    <html><head>
    <meta property="og:image" content="foo.com/2016/7/4/whatever.jpg"/>
    </head></html>'''
guess = guesser.guess_date(url='https://www.nytimes.com/2017/10/some_news.html',
                           html=html)
guess.date  # datetime.datetime(2017, 10, 15, 0, 0, tzinfo=<UTC>)
guess.accuracy is Accuracy.PARTIAL  # True

Installation

The library is not yet available on PyPI, so installation is via github only for now:

pip install git+https://github.com/mitmedialab/date_guesser

Performance

We benchmarked the accuracy against the wonderful newspaper library, using one hundred urls gathered from each of four very different topics in the mediacloud system. This includes blogs and news articles, as well as many urls that have no date (in which case a guess is marked correct only if it returns None).

Vaccines

date_guesser

newspaper

1 days

57

48

7 days

61

51

15 days

66

53

Aadhar Card in India

date_guesser

newspaper

1 days

73

44

7 days

74

44

15 days

74

44

Donald Trump in 2017

date_guesser

newspaper

1 days

79

60

7 days

83

61

15 days

85

61

Recipes for desserts and chocolate

date_guesser

newspaper

1 days

83

65

7 days

85

69

15 days

87

69

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

date_guesser-1.0.0.tar.gz (10.6 kB view hashes)

Uploaded Source

Built Distribution

date_guesser-1.0.0-py3-none-any.whl (11.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page