skip to navigation
skip to content

Not Logged In

pyrallel 0.2.1

Experimental tools for parallel machine learning

# Pyrallel - Parallel Data Analytics in Python

Overview: experimental project to investigate distributed computation patterns for machine learning and other semi-interactive data analytics tasks.

Scope:

  • focus on small to medium dataset that fits in memory on a small (10+ nodes) to medium cluster (100+ nodes).
  • focus on small to medium data (with data locality when possible).
  • focus on CPU bound tasks (e.g. training Random Forests) while trying to limit disk / network access to a minimum.
  • do not focus on HA / Fault Tolerance (yet).
  • do not try to invent new set of high level programming abstractions (yet): use a low level programming model (IPython.parallel) to finely control the cluster elements and messages transfered and help identify what are the practical underlying constraints in distributed machine learning setting.

Disclaimer: the public API of this library will probably not be stable soon as the current goal of this project is to experiment.

## Dependencies

The usual suspects: Python 2.7, NumPy, SciPy.

Fetch the development version (master branch) from:

StarCluster develop branch and its IPCluster plugin is also required to easily startup a bunch of nodes with IPython.parallel setup.

## Patterns currently under investigation

  • Asynchronous & randomized hyper-parameters search (a.k.a. Randomized Grid Search) for machine learning models
  • Share numerical arrays efficiently over the nodes and make them available to concurrently running Python processes without making copies in memory using memory-mapped files.
  • Distributed Random Forests fitting.
  • Ensembling heterogeneous library models.
  • Parallel implementation of online averaged models using a MPI AllReduce, for instance using MiniBatchKMeans on partitioned data.

See the content of the examples/ folder for more details.

## License

Simplified BSD.

## History

This project started at the [PyCon 2012 PyData sprint](http://wiki.ipython.org/PyCon12Sprint) as a set of proof of concept [IPython.parallel scripts](https://github.com/ogrisel/pycon-pydata-sprint).

 
File Type Py Version Uploaded on Size
pyrallel-0.2.1.tar.gz (md5) Source 2013-06-20 7KB
  • Downloads (All Versions):
  • 1 downloads in the last day
  • 35 downloads in the last week
  • 234 downloads in the last month