newspaper

Simplified python article discovery & extraction.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Inspired by requests for its simplicity and powered by lxml for its speed; newspaper is a Python 2 library for extracting & curating articles from the web.

Newspaper wants to change the way people handle article extraction with a new, more precise layer of abstraction. Newspaper caches whatever it can for speed. Also, everything is in unicode

Please refer to The Documentation for a quickstart tutorial!

A Glance:

>>> import newspaper

>>> cnn_paper = newspaper.build('http://cnn.com')

>>> for article in cnn_paper.articles:
>>>     print article.url
u'http://www.cnn.com/2013/11/27/justice/tucson-arizona-captive-girls/'
u'http://www.cnn.com/2013/12/11/us/texas-teen-dwi-wreck/index.html'
...

>>> for category in cnn_paper.category_urls():
>>>     print category

u'http://lifestyle.cnn.com'
u'http://cnn.com/world'
u'http://tech.cnn.com'
...

>>> article = cnn_paper.articles[0]

>>> article.download()

>>> article.html
u'<!DOCTYPE HTML><html itemscope itemtype="http://...'

>>> article.parse()

>>> article.authors
[u'Leigh Ann Caldwell', 'John Honway']

>>> article.text
u'Washington (CNN) -- Not everyone subscribes to a New Year's resolution...'

>>> article.nlp()

>>> article.keywords
['New Years', 'resolution', ...]

>>> article.summary
u'The study shows that 93% of people ...'

Documentation

Check out The Documentation for full and detailed guides using newspaper.

Features

News url identification
Text extraction from html
Keyword extraction from text
Summary extraction from text
Author extraction from text
Top image extraction from html
All image extraction from html
Multi-threaded article download framework
Google trending terms extraction

Get it now

$ pip install newspaper

IMPORTANT
If you know for sure that you'll use the natural language features,
nlp(), you must download some separate nltk corpora below.
You must download everything in python 2.6 - 2.7!

$ curl https://raw.github.com/codelucas/newspaper/master/download_corpora.py | python2.7

Todo List

Add a “follow_robots.txt” option in the config object.
Bake in the CSSSelect and BeautifulSoup dependencies

0.0.4 - Fully integrated python-goose library into newspaper. Article objects: now have much more options. All configurations are now based on Configuration() objects which can be passed into Source or Article objects. Default configuration setups make this easy. Added simple multithreading article download framework.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.0.7

Jun 14, 2017

0.1.0.6

Jun 14, 2017

0.1.0.5

Jun 14, 2017

0.1.0.4

Jun 14, 2017

0.1.0.3

Jun 14, 2017

0.1.0.2

Jun 13, 2017

0.1.0.1

Jun 13, 2017

0.1.0.0

Jun 11, 2017

0.0.9.9

Jun 11, 2017

0.0.9.8

Mar 4, 2015

0.0.9.6

Feb 6, 2015

0.0.9.5

Feb 4, 2015

0.0.9.2

Jan 22, 2015

0.0.9.1

Dec 29, 2014

0.0.9

Dec 17, 2014

0.0.8

Oct 13, 2014

0.0.7

Jun 17, 2014

0.0.6

Jan 18, 2014

0.0.5

Jan 9, 2014

This version

0.0.4

Dec 31, 2013

0.0.3

Dec 22, 2013

0.0.2

Dec 21, 2013

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

newspaper-0.0.4.tar.gz (6.7 MB view hashes)

Uploaded Dec 31, 2013 Source

Built Distribution

newspaper-0.0.4.macosx-10.8-intel.exe (6.9 MB view hashes)

Uploaded Dec 31, 2013 Source

Hashes for newspaper-0.0.4.tar.gz

Hashes for newspaper-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`359934ee0c47015687ac3b71d51c7d1a87e8b95ff96135bdbe5c4d2e2c20c735`
MD5	`89f2dc44324b9838cf4923446849d447`
BLAKE2b-256	`4410cc8abed3de450ea2925601e29951eec9658a19f18572429cc29380ec7ac8`

Hashes for newspaper-0.0.4.macosx-10.8-intel.exe

Hashes for newspaper-0.0.4.macosx-10.8-intel.exe
Algorithm	Hash digest
SHA256	`0e5e1c47863c23c4992d5365b1bce4c57fdd134d12ad260a36e81586dc78979e`
MD5	`ccf7a795cd9af87a23ea95a002131ec2`
BLAKE2b-256	`9afe35192071bf02cab3db681d08b83b365bf35a98035b06d640a5f4082b4cf8`