skip to navigation
skip to content

Not Logged In

pageone 0.1.0

a module for polling urls and stats from homepages

Latest Version: 0.1.7

pageone ====== a module for polling urls and stats from homepages

Install

pip install pageone

Test

Requires nose

nosetests

Usage

pageone does two things: extract article urls from a site’s homepage and also uses selenium and phantomjs to find the relative positions of these urls.

To get stats about the positions of links, use link_stats:

from pageone import PageOne

p = PageOne(url='http://www.propublica.org/')

# get stats about links positions
for link in p.link_stats():
    print link

This will return a list of dictionaries that look like this:

{
 'bucket': 4,
 'datetime': datetime.datetime(2014, 6, 7, 16, 6, 3, 533818),
 'font_size': 13,
 'has_img': 1,
 'headline': u'',
 'homepage': 'http://www.propublica.org/',
 'img_area': 3969,
 'img_height': 63,
 'img_src': u'http://www.propublica.org/images/ngen/gypsy_image_medium/mpmh_victory_drive_140x140_130514_1.jpg',
 'img_width': 63,
 'url': u'http://www.propublica.org/article/protect-service-members-defense-department-plans-broad-ban-high-cost-loans',
 'x': 61,
 'x_bucket': 1,
 'y': 730,
 'y_bucket': 4
}

Here bucket variables represent where a link falls in 200x200 pixel grid. For x_bucket this number moves from left-to-right. For y_bucket, it moves top-to-bottom. bucket moves from top-left to bottom right. You can customize the size of this grid by passing in bucket_pixels to link_stats, eg:

from pageone import PageOne

p = PageOne(url='http://www.propublica.org/')

# get stats about links positions
for link in p.link_stats(bucket_pixels = 100):
    print link

To get simply get all of the article urls on a homepage, use articles:

from pageone import PageOne
p = PageOne(url='http://www.propublica.org/')

for article in p.articles():
  print article

If you want to get article urls from other sites, use incl_external:

from pageone import PageOne
p = PageOne(url='http://www.propublica.org/')

for article in p.articles(incl_external=True):
  print article

How do I know which urls are articles?

pageone uses siegfried for url parsing and validation. If you want to apply a custom regex for article url validation, you can pass in a pattern to either link_stats or articles, eg:

from pageone import PageOne
import re

pattern = re.compile(r'.*propublica.org/[a-z]+/[a-z0-9/-]+')

p = PageOne(url='http://www.propublica.org/')

for article in p.articles(pattern=pattern):
  print article
 
File Type Py Version Uploaded on Size
pageone-0.1.0.macosx-10.9-intel.exe (md5) MS Windows installer any 2014-06-08 69KB
pageone-0.1.0.tar.gz (md5) Source 2014-06-08 5KB
  • Downloads (All Versions):
  • 37 downloads in the last day
  • 466 downloads in the last week
  • 640 downloads in the last month