Skip to main content

cabu is a simple REST microservice to scrap content from anywhere.

Project description

Cabu

Documentation Status

Cabu is a simple microservice framework to remotely crawl websites. It’s built on Flask and Selenium, contains a virtual display wrapper and few methods.

Full documentation here

Usage

@app.route('/gizmodo_last_articles_links')
def gizmodo_last_articles():
    app.webdriver.get('http://www.gizmodo.com')
    articles_links = [i.get_attribute('href') for i in app.webdriver.find_elements_by_css_selector('h1.headline>a')]

    return jsonify({'articles': articles_links})

Installing

$ pip install cabu

Features

  • Selenium configuration out of the box

  • Flask wrapping

  • Crawling methods included

  • AWS S3 Export

  • FTP / FTPS

  • Cookies persistence

  • Link extractor

  • Proxy configuration

  • Headless optional for local debug

  • Docker pre-configured distributed environment

  • Database handler

  • Compatible with most Flask extensions (Flask-Admin, Flask-Mail, Flask-OAuth, …)

  • 12 Factors compliance

(Likely to come soon)

  • CouchDB support

  • Couchbase support

  • Mobile drivers

  • SFTP

  • HtmlUnit web driver

  • Remote webdriver wrapper

  • Parallelization

  • Neural Network plugins

Testing

All tests were written using Docker services instead of Mocks. Alternative mocks will be added soon ;)

$ pip install -r requirements-dev.txt
$ py.test cabu/tests

Contributing

Please see the Contribute page.

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page