skip to navigation
skip to content

Not Logged In

scrapelib 0.7.3

a library for scraping things

Latest Version: 1.0.0

=========
scrapelib
=========

scrapelib is a library for making requests to websites, particularly those
that may be less-than-reliable.

scrapelib originated as part of the `Open States <http: openstates.org="" `_="" project="" to="" scrape="" the="" websites="" of="" all="" 50="" state="" legislatures="" and="" as="" a="" result="" was="" therefore="" designed="" with="" features="" desirable="" when="" dealing="" with="" sites="" that="" have="" intermittent="" errors="" or="" require="" rate-limiting.="" as="" of="" version="" 0.7="" scrapelib="" has="" been="" retooled="" to="" take="" advantage="" of="" the="" superb="" `requests="" <http:="" python-requests.org="">`_ library.

Advantages of using scrapelib over alternatives like httplib2 simply using
requests as-is:

* All of the power of the suberb `requests <http: python-requests.org="">`_ library.
* HTTP, HTTPS, and FTP requests via an identical API
* support for simple caching with pluggable cache backends
* request throtting
* configurable retries for non-permanent site failures
* optional robots.txt compliance

scrapelib is a project of Sunlight Labs (c) 2012.
All code is released under a BSD-style license, see LICENSE for details.

Written by James Turk <jturk@sunlightfoundation.com>

Contributors:
* Michael Stephens - initial urllib2/httplib2 version
* Joe Germuska - fix for IPython embedding
* Alex Chiang - fix to test suite


Requirements
============

* python 2.6, 2.7, or 3.2
* requests

Installation
============

scrapelib is available on PyPI and can be installed via ``pip install scrapelib``

PyPI package: http://pypi.python.org/pypi/scrapelib

Source: http://github.com/sunlightlabs/scrapelib

Documentation: http://scrapelib.readthedocs.org/en/latest/

Example Usage
=============

::

import scrapelib
s = scrapelib.Scraper(requests_per_minute=10, allow_cookies=True,
follow_robots=True)

# Grab Google front page
s.urlopen('http://google.com')

# Will raise RobotExclusionError
s.urlopen('http://google.com/search')

# Will be throttled to 10 HTTP requests per minute
while True:
s.urlopen('http://example.com')  
File Type Py Version Uploaded on Size
scrapelib-0.7.3.tar.gz (md5) Source 2012-06-21 12KB
  • Downloads (All Versions):
  • 108 downloads in the last day
  • 649 downloads in the last week
  • 2138 downloads in the last month