riko · PyPI

A stream processing framework modeled after Yahoo! Pipes.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Index

Introduction

riko is a Python framework for analyzing and processing streams of structured data. It has synchronous and asynchronous APIs, supports parallel execution, and is well suited for processing rss feeds [1].

With riko, you can

Read csv/xml/json/html files
Create text and data processing workflows via modular pipes
Parse, extract, and process rss feeds
Create awesome mashups [2], APIs, and maps
Perform parallel processing via cpus/processors or threads
and much more…

Notes

Requirements

riko has been tested and is known to work on Python 2.7 and PyPy 4.0.

Optional Dependencies

Feature	Dependency	Installation
Async API	Twisted	pip install riko[async]
Accelerated xml parsing	lxml [3]	pip install lxml

Notes

Word Count

In this example, we use several pipes to count the words on a webpage.

>>> ### Create a SyncPipe workflow ###
>>> #
>>> # `SyncPipe` is a workflow convenience class that enables method
>>> # chaining and parallel processing
>>> from riko.lib.collections import SyncPipe
>>>
>>> ### Set the pipe configurations ###
>>> #
>>> # Notes:
>>> #   1. `get_path` just looks up a file in the `data` directory
>>> #   2. the `detag` option will strip all html tags from the result
>>> url = get_path('users.jyu.fi.html')                                            # 1
>>> fetch_conf = {'url': url, 'start': '<body>', 'end': '</body>', 'detag': True}  # 2
>>> replace_conf = {'rule': {'find': '\n', 'replace': ' '}}
>>>
>>> counts = (SyncPipe('fetchpage', conf=fetch_conf)
...     .strreplace(conf=replace_conf, assign='content')
...     .stringtokenizer(conf={'delimiter': ' '}, emit=True)
...     .count()
...     .output)
>>>
>>> next(counts)
{'count': 70}

Motivation

Why I built riko

Yahoo! Pipes [4] was a user friendly web application used to “aggregate, manipulate, and mashup content from around the web.” Wanting to create custom pipes, I came across pipe2py which translated a Yahoo! Pipes into python code. pipe2py suited my needs at the time but was unmaintained and lacked asynchronous and parallel processing APIs.

riko addresses the shortcomings of pipe2py but breaks compatibility with Yahoo! Pipes workflows. riko contains ~40 built-in modules, aka pipes, that allow you to programatically recreate much of what you previously could on Yahoo! Pipes.

Why you should use riko

riko provides a number of benefits / differences from other stream processing applications such as Huginn, Flink, Spark, and Storm [5]. Namely:

a small footprint (CPU and memory usage)
native RSS support
simple installation and usage
pypy support
modular pipes to filter, sort, and modify feeds

The subsequent tradeoffs riko makes are:

not distributed (able to run on a cluster of servers)
no GUI for creating workflows
doesn’t continually monitor feeds for new data
can’t react to specific events
iterator (pull) based so only supports a single consumer

The following table summaries these observations:

Framework	Stream Type	Footprint	RSS	no outside dependencies	CEP [6]	distributed
riko	pull	small	√	√
Huginn	push	med	√		√
Others	push	large			√	√

For more detailed information, please check-out the FAQ.

Notes

Usage

riko is intended to be used directly as a Python library.

Fetching feeds

riko can read both local and remote filepaths via source pipes. All source pipes return an equivalent feed iterator of dictionaries, aka items.

>>> from itertools import chain
>>> from riko import get_path
>>> from riko.modules.pipefetch import pipe as fetch
>>> from riko.modules.pipefetchpage import pipe as fetchpage
>>> from riko.modules.pipefetchdata import pipe as fetchdata
>>> from riko.modules.pipefetchsitefeed import pipe as fetchsitefeed
>>> from riko.modules.pipefeedautodiscovery import pipe as autodiscovery
>>>
>>> ### Fetch a url ###
>>> feed = fetchpage(conf={'url': 'https://news.ycombinator.com'})
>>>
>>> ### Fetch a filepath ###
>>> #
>>> # Note: `get_path` just looks up a file in the `data` directory
>>> feed = fetchdata(conf={'url': get_path('quote.json')})
>>>
>>> ### View the fetched data ###
>>> item = next(feed)
>>> item['list']['resources'][0]['resource']['fields']['symbol']
'KRW=X'

>>> ### Fetch an rss feed ###
>>> feed = fetch(conf={'url': 'https://news.ycombinator.com/rss'})
>>>
>>> ### Fetch the first rss feed found ###
>>> feed = fetchsitefeed(conf={'url': http://www.bbc.com/news})
>>>
>>> ### Find all rss links and fetch the feeds ###
>>> entries = autodiscovery(conf={'url': http://edition.cnn.com/services/rss/})
>>> urls = (e['link'] for e in entries)
>>> feed = chain.from_iterable(fetch(conf={'url': url}) for url in urls)
>>>
>>> ### Alternatively, create a SyncCollection ###
>>> #
>>> # `SyncCollection` is a url fetching convenience class with support for
>>> # parallel processing
>>> from riko.lib.collections import SyncCollection
>>>
>>> sources = [{'url': url} for url in urls]
>>> feed = SyncCollection(sources).fetch()
>>>
>>> ### View the fetched rss feed(s) ###
>>> #
>>> # Note: regardless of how you fetch an rss feed, it will have the same
>>> # structure
>>> item = next(feed)
>>> sorted(item.keys())
[
    'author', 'author.name', 'author.uri', 'comments', 'content',
    'dc:creator', 'id', 'link', 'pubDate', 'summary', 'title',
    'updated', 'updated_parsed', 'y:id', 'y:published', 'y:title']
>>> item['title'], item['author'], item['link']
(
    u'Using NFC tags in the car', u'Liam Green-Hughes',
    u'http://www.greenhughes.com/content/using-nfc-tags-car')

Please see the FAQ for a complete list of supported file types and protocols

Synchronous processing

riko can modify feeds by combining any of the 40 built-in pipes

>>> from itertools import chain
>>> from riko import get_path
>>> from riko.modules.pipefetch import pipe as fetch
>>> from riko.modules.pipefilter import pipe as pfilter
>>> from riko.modules.pipesubelement import pipe as subelement
>>> from riko.modules.piperegex import pipe as regex
>>> from riko.modules.pipesort import pipe as sort
>>>
>>> ### Set the pipe configurations ###
>>> #
>>> # Notes:
>>> #   1. `get_path` just looks up a file in the `data` directory
>>> #   2. the `dotall` option is used to match `.*` across newlines
>>> fetch_conf = {'url': get_path('feed.xml')}                                          # 1
>>> filter_rule = {'field': 'y:published', 'op': 'before', 'value': '2/5/09'}
>>> sub_conf = {'path': 'content.value'}
>>> match = r'(.*href=")([\w:/.@]+)(".*)'
>>> regex_rule = {'field': 'content', 'match': match, 'replace': '$2', 'dotall': True}  # 2
>>> sort_conf = {'rule': {'sort_key': 'content', 'sort_dir': 'desc'}}
>>>
>>> ### Create a SyncPipe workflow ###
>>> #
>>> # `SyncPipe` is a workflow convenience class that enables method
>>> # chaining and parallel processing.
>>> #
>>> # The following workflow will:
>>> #   1. fetch the rss feed
>>> #   2. filter for items published before 2/5/2009
>>> #   3. extract the path `content.value` from each feed item
>>> #   4. replace the extracted text with the last href url contained
>>> #      within it
>>> #   5. reverse sort the items by the replaced url
>>> #   5. return the raw feed iterator
>>> #
>>> # Note: sorting is not lazy so take caution when using this pipe
>>> from riko.lib.collections import SyncPipe
>>>
>>> output = (SyncPipe('fetch', conf=fetch_conf)  # 1
...     .filter(conf={'rule': filter_rule})       # 2
...     .subelement(conf=sub_conf, emit=True)     # 3
...     .regex(conf={'rule': regex_rule})         # 4
...     .sort(conf=sort_conf)                     # 5
...     .output)                                  # 6
>>>
>>> next(output)
{'content': 'mailto:mail@writetoreply.org'}

Please see Design Principles for an alternative (function based) workflow. Please see pipes for a complete list of available pipes.

Parallel processing

An example using riko’s ThreadPool based parallel API

>>> from riko import get_path
>>> from riko.lib.collections import SyncPipe
>>>
>>> ### Set the pipe configurations ###
>>> #
>>> # Notes:
>>> #   1. `get_path` just looks up a file in the `data` directory
>>> #   2. the `dotall` option is used to match `.*` across newlines
>>> url = get_path('feed.xml')                                                          # 1
>>> filter_rule1 = {'field': 'y:published', 'op': 'before', 'value': '2/5/09'}
>>> match = r'(.*href=")([\w:/.@]+)(".*)'
>>> regex_rule = {'field': 'content', 'match': match, 'replace': '$2', 'dotall': True}  # 2
>>> filter_rule2 = {'field': 'content', 'op': 'contains', 'value': 'file'}
>>> strtransform_conf = {'rule': {'transform': 'rstrip', 'args': '/'}}
>>>
>>> ### Create a parallel SyncPipe workflow ###
>>> #
>>> # The following workflow will:
>>> #   1. fetch the rss feed
>>> #   2. filter for items published before 2/5/2009
>>> #   3. extract the path `content.value` from each feed item
>>> #   4. replace the extracted text with the last href url contained
>>> #      within it
>>> #   5. filter for items with local file urls (which happen to be rss
>>> #      feeds)
>>> #   6. strip any trailing `\` from the url
>>> #   7. remove duplicate urls
>>> #   8. fetch each rss feed
>>> #   9. Merge the rss feeds into a list
>>> feed = (SyncPipe('fetch', conf={'url': url}, parallel=True)  # 1
...     .filter(conf={'rule': filter_rule1})                     # 2
...     .subelement(conf=sub_conf, emit=True)                    # 3
...     .regex(conf={'rule': regex_rule})                        # 4
...     .filter(conf={'rule': filter_rule2})                     # 5
...     .strtransform(conf=strtransform_conf)                    # 6
...     .uniq(conf={'uniq_key': 'strtransform'})                 # 7
...     .fetch(conf={'url': {'subkey': 'strtransform'}})         # 8
...     .list)                                                   # 9
>>>
>>> len(feed)
25

Asynchronous processing

To enable this asynchronous processing, you must install the async module.

pip install riko[async]

An example using riko’s optional Twisted powered asynchronous API.

>>> from twisted.internet.task import react
>>> from twisted.internet.defer import inlineCallbacks
>>> from riko import get_path
>>> from riko.twisted.collections import AsyncPipe
>>>
>>> ### Set the pipe configurations ###
>>> #
>>> # Notes:
>>> #   1. `get_path` just looks up a file in the `data` directory
>>> #   2. the `dotall` option is used to match `.*` across newlines
>>> url = get_path('feed.xml')                                                          # 1
>>> filter_rule1 = {'field': 'y:published', 'op': 'before', 'value': '2/5/09'}
>>> match = r'(.*href=")([\w:/.@]+)(".*)'
>>> regex_rule = {'field': 'content', 'match': match, 'replace': '$2', 'dotall': True}  # 2
>>> filter_rule2 = {'field': 'content', 'op': 'contains', 'value': 'file'}
>>> strtransform_conf = {'rule': {'transform': 'rstrip', 'args': '/'}}
>>>
>>> ### Create a AsyncPipe workflow ###
>>> #
>>> # See `Parallel processing` above for an explanation of the steps this
>>> # performs
>>> @inlineCallbacks
... def run(reactor):
...     feed = yield (AsyncPipe('fetch', conf={'url': url})
...         .filter(conf={'rule': filter_rule1})
...         .subelement(conf=sub_conf, emit=True)
...         .regex(conf={'rule': regex_rule})
...         .filter(conf={'rule': filter_rule2})
...         .strtransform(conf=strtransform_conf)
...         .uniq(conf={'uniq_key': 'strtransform'})
...         .fetch(conf={'url': {'subkey': 'strtransform'}})
...         .list)
...
...     print(len(feed))
...
>>> react(run)
25

Cookbook

Please see the cookbook or ipython notebook for more examples.

Installation

(You are using a virtualenv, right?)

At the command line, install riko using either pip (recommended)

pip install riko

or easy_install

easy_install riko

Please see the installation doc for more details.

Project Structure

┌── CONTRIBUTING.rst
├── LICENSE
├── MANIFEST.in
├── Makefile
├── README.rst
├── bin
│   └── run
├── data/*
├── dev-requirements.txt
├── docs
│   ├── AUTHORS.rst
│   ├── CHANGES.rst
│   ├── COOKBOOK.rst
│   ├── FAQ.rst
│   ├── INSTALLATION.rst
│   └── TODO.rst
├── examples
│   ├── __init__.py
│   ├── pipe_base.py
│   ├── pipe_gigs.py
│   ├── pipe_test.py
│   ├── usage.ipynb
│   └── usage.py
├── helpers/*
├── manage.py
├── py2-requirements.txt
├── requirements.txt
├── riko
│   ├── __init__.py
│   ├── lib
│   │   ├── __init__.py
│   │   ├── autorss.py
│   │   ├── collections.py
│   │   ├── dotdict.py
│   │   ├── log.py
│   │   └── utils.py
│   ├── modules/*
│   └── twisted
│       ├── __init__.py
│       ├── collections.py
│       └── utils.py
├── setup.cfg
├── setup.py
├── tests
│   ├── __init__.py
│   └── standard.rc
└── tox.ini

Design Principles

The primary data structures in riko are the item, and feed. An item is a simple dictionary, and a feed is an iterator of items. You can create a feed manually with something as simple as [{'content': 'hello world'}]. The primary way to manipulate a feed in riko is via a pipe. A pipe is simply a function that accepts either a feed or item, and returns an iterator of item’s. You can create a workflow by using the output of one pipe as the input to another pipe.

riko pipes come in two flavors; operator and processor. operator’s operate on an entire feed at once and are unable to handle individual items. Example operator’s include pipecount, pipefilter, and pipereverse.

>>> from riko.modules.pipereverse import pipe
>>>
>>> feed = [{'title': 'riko pt. 1'}, {'title': 'riko pt. 2'}]
>>> next(pipe(feed))
{'title': 'riko pt. 2'}

processor’s process individual feed items and can be parallelized across threads or processes. Example processor’s include pipefetchsitefeed, pipehash, pipeitembuilder, and piperegex.

>>> from riko.modules.pipehash import pipe
>>>
>>> item = {'title': 'riko pt. 1'}
>>> feed = pipe(item, field='title')
>>> next(feed)
{'title': 'riko pt. 1', 'hash': 2853617420}

Some processor’s, e.g. pipestringtokenizer return multiple results.

>>> from riko.modules.pipestringtokenizer import pipe
>>>
>>> item = {'title': 'riko pt. 1'}
>>> tokenizer_conf = {'delimiter': ' '}
>>> feed = pipe(item, conf=tokenizer_conf, field='title')
>>> next(feed)
{
    'title': 'riko pt. 1',
    'stringtokenizer': [
        {'content': 'riko'},
        {'content': 'pt.'},
        {'content': '1'}]}

>>> # In this case, if we just want the result, we can `emit` it instead
>>> feed = pipe(item, conf=tokenizer_conf, field='title', emit=True)
>>> next(feed)
{'content': 'riko'}

operator’s are split into sub-types of aggregator and composer. aggregator’s, e.g., pipecount, aggregate all items of a feed into a single value while composer’s, e.g., pipefilter composed a new feed from a subset or all of the available items.

>>> from riko.modules.pipecount import pipe
>>>
>>> feed = [{'title': 'riko pt. 1'}, {'title': 'riko pt. 2'}]
>>> next(pipe(feed))
{'count': 2}

processor’s are split into sub-types of source and transformer. source’s, e.g., pipeitembuilder, can create a feed while transformer’s, e.g. pipehash can only transform items in a feed.

>>> from riko.modules.pipeitembuilder import pipe
>>>
>>> attrs = {'key': 'title', 'value': 'riko pt. 1'}
>>> next(pipe(conf={'attrs': attrs}))
{'title': 'riko pt. 1'}

The following table summaries these observations:

type	sub-type	input	output	is parallelizable?	is feed creator?
operator	aggregator	feed	aggregation
operator	composer	feed	feed
processor	source	item	feed	√	√
processor	transformer	item	feed	√

If you are unsure of the type of pipe you have, check its metadata.

>>> from riko.modules.pipefetchpage import asyncPipe
>>> from riko.modules.pipecount import pipe
>>>
>>> asyncPipe.__dict__
{'type': 'processor', 'name': 'fetchpage', 'sub_type': 'source'}
>>> pipe.__dict__
{'type': 'operator', 'name': 'count', 'sub_type': 'aggregator'}

The SyncPipe and AsyncPipe classes (among other things) perform this check for you to allow for convenient method chaining and transparent parallelization.

>>> from riko.lib.collections import SyncPipe
>>>
>>> attrs = [
...     {'key': 'title', 'value': 'riko pt. 1'},
...     {'key': 'content', 'value': "Let's talk about riko!"}]
>>> sync_pipe = SyncPipe('itembuilder', conf={'attrs': attrs})
>>> sync_pipe.hash().list[0]
[
    {
        'title': 'riko pt. 1',
        'content': "Let's talk about riko!",
        'hash': 1346301218}]

Please see the cookbook for advanced examples including how to wire in vales from other pipes or accept user input.

Notes

Scripts

riko comes with a built in task manager manage.py

Setup

pip install -r dev-requirements.txt

Examples

Run python linter and nose tests

manage lint
manage test

Contributing

Please mimic the coding style/conventions used in this repo. If you add new classes or functions, please add the appropriate doc blocks with examples. Also, make sure the python linter and nose tests pass.

Please see the contributing doc for more details.

Credits

Shoutout to pipe2py for heavily inspiring riko. riko started out as a fork of pipe2py, but has since diverged so much that little (if any) of the original code-base remains.

More Info

License

riko is distributed under the MIT License.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.67.0

Dec 28, 2021

0.66.0

Aug 14, 2020

0.65.0

Aug 14, 2020

0.64.3

Aug 14, 2020

0.64.2

Aug 14, 2020

0.64.1

Aug 14, 2020

0.64.0

Aug 12, 2020

0.63.0

Aug 12, 2020

0.62.2

Jul 30, 2020

0.62.1

Jul 30, 2020

0.62.0

Jul 29, 2020

0.61.4

Jul 29, 2020

0.61.2

Jul 7, 2020

0.61.1

Feb 2, 2020

0.60.4

Sep 13, 2018

0.60.3

Sep 12, 2018

0.60.2

Aug 18, 2018

0.60.1

Aug 18, 2018

0.60.0

May 23, 2018

0.59.1

May 18, 2018

0.59.0

May 18, 2018

0.58.0

May 18, 2018

0.57.0

Aug 31, 2017

0.56.3

Aug 18, 2017

0.56.2

Aug 17, 2017

0.56.1

Aug 17, 2017

0.56.0

Aug 17, 2017

0.55.0

Aug 17, 2017

0.54.1

Aug 17, 2017

0.54.0

Aug 16, 2017

0.53.0

Aug 16, 2017

0.52.3

Aug 12, 2017

0.52.2

Aug 11, 2017

0.52.1

Aug 9, 2017

0.51.0

May 1, 2017

0.50.0

Apr 12, 2017

0.49.2

Apr 12, 2017

0.47.0

Apr 4, 2017

0.46.1

Apr 4, 2017

0.46.0

Apr 4, 2017

0.45.1

Apr 4, 2017

0.45.0

Apr 1, 2017

0.44.0

Apr 1, 2017

0.43.1

Mar 24, 2017

0.43.0

Mar 24, 2017

0.42.0

Mar 24, 2017

0.41.0

Mar 18, 2017

0.40.1

Mar 16, 2017

0.39.0

Mar 11, 2017

0.38.0

Mar 10, 2017

0.37.0

Sep 29, 2016

0.36.0

Sep 29, 2016

0.35.3

Jul 26, 2016

0.35.1

Jul 22, 2016

0.35.0

Jul 19, 2016

0.33.0

Jul 1, 2016

0.32.1

Jun 16, 2016

0.30.0

Jun 15, 2016

This version

0.29.0

Jun 6, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

riko-0.29.0.tar.gz (599.0 kB view hashes)

Uploaded Jun 6, 2016 Source

Built Distribution

riko-0.29.0-py2-none-any.whl (130.6 kB view hashes)

Uploaded Jun 6, 2016 Python 2

Hashes for riko-0.29.0.tar.gz

Hashes for riko-0.29.0.tar.gz
Algorithm	Hash digest
SHA256	`5a7f0963d357328ab2d109a2b43bdf6cbde7a87f23ca3f9dc8a1f71f0a5a20e8`
MD5	`267ec7cd37d8b1b996033b8fa737d071`
BLAKE2b-256	`c37870fdc9f44a12194b83e65564d8115ada88d976ffb727f9c076cf2bbad833`

Hashes for riko-0.29.0-py2-none-any.whl

Hashes for riko-0.29.0-py2-none-any.whl
Algorithm	Hash digest
SHA256	`04571f34c32ff01406c0b279e1de6026ad324d208cc35f054dea551b94a53c5e`
MD5	`20e2f8b26f7a011f444a80530c28ab99`
BLAKE2b-256	`5d8da595d7d30ba4fd6fd4ce93041e5632f5c9c33042166862a09599bd20ff4b`

riko 0.29.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Index

Introduction

Notes

Requirements

Optional Dependencies

Notes

Word Count

Motivation

Why I built riko

Why you should use riko

Notes

Usage

Usage Index

Fetching feeds

Synchronous processing

Parallel processing

Asynchronous processing

Cookbook

Installation

Project Structure

Design Principles

Notes

Scripts

Setup

Examples

Contributing

Credits

More Info

License

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution