skip to navigation
skip to content

Not Logged In

picrawler 0.1.1

A distributed web crawler using PiCloud.

https://badge.fury.io/py/picrawler.png https://travis-ci.org/studio-ousia/picrawler.png?branch=master

PiCrawler is a distributed web crawler using PiCloud.

Using PiCrawler, you can easily implement a distributed web crawler within a few lines of code.

>>> from picrawler import PiCloudConnection
>>>
>>> with PiCloudConnection() as conn:
...     response = conn.send(['http://en.wikipedia.org/wiki/Star_Wars',
...                           'http://en.wikipedia.org/wiki/Darth_Vader'])
...     print 'status code:', response[0].status_code
...     print 'content:', response[0].content[:15]
status code: 200
content: <!DOCTYPE html>

Installation

To install PiCrawler, simply:

$ pip install picrawler

Alternatively,

$ easy_install picrawler

PiCloud Setup

Before using PiCrawler, it is neccessary to configure an API key of PiCloud.

>>> import cloud
>>> cloud.setkey(API_KEY, API_SECRETKEY)

You can obtain an API key by signing up on PiCloud.

Using Real-time Cores

PiCloud enables you to reserve your exclusive computational resources by requesting real-time cores.

PiCrawler provides a thin wrapper class for requesting the cores.

NOTE: s1 core is the most suitable for crawling tasks, because PiCloud ensures that each s1 core has a unique IP address.

>>> from picrawler import RTCoreRequest
>>>
>>> with RTCoreRequest(core_type='s1', num_cores=10):
...     pass

Customizing Requests

You can easily customize the request headers and other internal behaviors by using Request instances instead of raw URL strings. Since PiCrawler internally uses Python requests, it supports all arguments that are supported in Python requests.

>>> from picrawler import PiCloudConnection
>>> from picrawler.request import Request
>>>
>>> req = Request('http://en.wikipedia.org/wiki/Star_Wars',
...               'GET',
...               headers={'User-Agent': 'MyCrawler'},
...               args={'timeout': 5})
>>>
>>> with PiCloudConnection() as conn:
...     response = conn.send([req])

Defining Callbacks

You can also define callbacks to the request.

>>> import logging
>>>
>>> from picrawler import PiCloudConnection
>>> from picrawler.request import Request
>>>
>>> req = Request('http://en.wikipedia.org/wiki/Star_Wars', 'GET',
...               success_callback=lambda resp: logging.info(resp.content),
...               error_callback=lambda resp: logging.exception(resp.exception))
>>>
>>> with PiCloudConnection() as conn:
...     response = conn.send([req])

Documentation

Documentation is available at http://picrawler.readthedocs.org/.

 
File Type Py Version Uploaded on Size
picrawler-0.1.1-py2.7.egg (md5) Python Egg 2.7 2013-10-08 21KB
picrawler-0.1.1.tar.gz (md5) Source 2013-10-08 7KB
  • Downloads (All Versions):
  • 36 downloads in the last day
  • 356 downloads in the last week
  • 1679 downloads in the last month
  • Author: Studio Ousia
  • Maintainer: Ikuya Yamada
  • Home Page: http://github.com/studio-ousia/picrawler
  • License:
    Copyright 2013 Studio Ousia
    
       Licensed under the Apache License, Version 2.0 (the "License");
       you may not use this file except in compliance with the License.
       You may obtain a copy of the License at
    
           http://www.apache.org/licenses/LICENSE-2.0
    
       Unless required by applicable law or agreed to in writing, software
       distributed under the License is distributed on an "AS IS" BASIS,
       WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
       See the License for the specific language governing permissions and
       limitations under the License.
  • Package Index Owner: ousia, ikuyamada
  • DOAP record: picrawler-0.1.1.xml