mrjob

Python MapReduce framework

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- System :: Distributed Computing

Project description

mrjob
=====

.. image:: http://github.com/yelp/mrjob/raw/master/docs/logos/logo_medium.png

mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming
jobs.

`v0.3.4.1 documentation <http://packages.python.org/mrjob/>`_

`v0.4-dev documentation <http://mrjob.readthedocs.org/en/latest/>`_

mrjob fully supports Amazon's Elastic MapReduce (EMR) service, which allows you
to buy time on a Hadoop cluster on an hourly basis. It also works with your own
Hadoop cluster.

Some important features:

* Run jobs on EMR, your own Hadoop cluster, or locally (for testing).
* Write multi-step jobs (one map-reduce step feeds into the next)
* Duplicate your production environment inside Hadoop
* Upload your source tree and put it in your job's ``$PYTHONPATH``
* Run make and other setup scripts
* Set environment variables (e.g. ``$TZ``)
* Easily install python packages from tarballs (EMR only)
* Setup handled transparently by ``mrjob.conf`` config file
* Automatically interpret error logs from EMR
* SSH tunnel to hadoop job tracker on EMR
* Minimal setup
* To run on EMR, set ``$AWS_ACCESS_KEY_ID`` and ``$AWS_SECRET_ACCESS_KEY``
* To run on your Hadoop cluster, install ``simplejson`` and make sure
``$HADOOP_HOME`` is set.

Installation
------------

From PyPI:

``pip install mrjob``

From source:

``python setup.py install``

A Simple Map Reduce Job
-----------------------

Code for this example and more live in ``mrjob/examples``.

.. code:: python

"""The classic MapReduce job: count the frequency of words.
"""
from mrjob.job import MRJob
import re

WORD_RE = re.compile(r"[\w']+")

class MRWordFreqCount(MRJob):

def mapper(self, _, line):
for word in WORD_RE.findall(line):
yield (word.lower(), 1)

def combiner(self, word, counts):
yield (word, sum(counts))

def reducer(self, word, counts):
yield (word, sum(counts))

if __name__ == '__main__':
MRWordFreqCount.run()

Try It Out!
-----------

::

# locally
python mrjob/examples/mr_word_freq_count.py README.rst > counts
# on EMR
python mrjob/examples/mr_word_freq_count.py README.rst -r emr > counts
# on your Hadoop cluster
python mrjob/examples/mr_word_freq_count.py README.rst -r hadoop > counts

Setting up EMR on Amazon
------------------------

* create an `Amazon Web Services account <http://aws.amazon.com/>`_
* sign up for `Elastic MapReduce <http://aws.amazon.com/elasticmapreduce/>`_
* Get your access and secret keys (click "Security Credentials" on
`your account page <http://aws.amazon.com/account/>`_)
* Set the environment variables ``$AWS_ACCESS_KEY_ID`` and
``$AWS_SECRET_ACCESS_KEY`` accordingly

Advanced Configuration
----------------------

To run in other AWS regions, upload your source tree, run ``make``, and use
other advanced mrjob features, you'll need to set up ``mrjob.conf``. mrjob looks
for its conf file in:

* The contents of ``$MRJOB_CONF``
* ``~/.mrjob.conf``
* ``/etc/mrjob.conf``

See `the mrjob.conf documentation
<http://packages.python.org/mrjob/configs-conf.html>`_ for more information.

Project Links
-------------

* `Source code <http://github.com/Yelp/mrjob>`_
* `Documentation <http://packages.python.org/mrjob/>`_
* `Discussion group <http://groups.google.com/group/mrjob>`_

Reference
---------

* `Hadoop MapReduce <http://hadoop.apache.org/mapreduce/>`_
* `Elastic MapReduce <http://aws.amazon.com/documentation/elasticmapreduce/>`_

More Information
----------------

* `PyCon 2011 mrjob overview <http://blip.tv/pycon-us-videos-2009-2010-2011/pycon-2011-mrjob-distributed-computing-for-everyone-4898987/>`_
* `Introduction to Recommendations and MapReduce with mrjob <http://aimotion.blogspot.com/2012/08/introduction-to-recommendations-with.html>`_
(`source code <https://github.com/marcelcaraciolo/recsys-mapreduce-mrjob>`_)
* `Social Graph Analysis Using Elastic MapReduce and PyPy <http://postneo.com/2011/05/04/social-graph-analysis-using-elastic-mapreduce-and-pypy>`_

Thanks to `Greg Killion <mailto:greg@blind-works.net>`_
(`blind-works.net <http://www.blind-works.net/>`_) for the logo.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- System :: Distributed Computing

Release history Release notifications | RSS feed

0.7.4

Sep 17, 2020

0.7.3

Jun 6, 2020

0.7.2

Apr 13, 2020

0.7.1

Dec 28, 2019

0.7.0

Nov 22, 2019

0.6.12

Oct 23, 2019

0.6.11

Oct 8, 2019

0.6.10

Jul 19, 2019

0.6.9

May 29, 2019

0.6.8

Apr 26, 2019

0.6.7

Jan 16, 2019

0.6.6

Nov 6, 2018

0.6.5

Sep 7, 2018

0.6.4

Aug 11, 2018

0.6.3

May 31, 2018

0.6.2

Mar 23, 2018

0.6.1

Nov 28, 2017

0.6.0

Nov 1, 2017

0.5.12

Jul 27, 2018

0.5.11

Aug 29, 2017

0.5.10

May 12, 2017

0.5.9

Mar 20, 2017

0.5.8

Feb 1, 2017

0.5.7

Dec 19, 2016

0.5.6

Sep 12, 2016

0.5.5

Sep 5, 2016

0.5.4

Aug 27, 2016

0.5.3

Jul 16, 2016

0.5.2

May 23, 2016

0.5.1

Apr 29, 2016

0.5.0

Mar 28, 2016

0.4.6

Nov 9, 2015

0.4.5

Jul 28, 2015

0.4.4

Apr 22, 2015

0.4.3

Apr 8, 2015

0.4.2

Nov 28, 2013

0.4.1

Sep 17, 2013

This version

0.4

May 1, 2013

0.4.0

Nov 8, 2013

0.4-RC1 pre-release

Mar 21, 2013

0.4-dev pre-release

Mar 21, 2013

0.3.5

Aug 29, 2012

0.3.4.1

Jun 13, 2012

0.3.3.2

Apr 11, 2012

0.3.3.1

Apr 5, 2012

0.3.3

Apr 4, 2012

0.3.2

Feb 22, 2012

0.3.1

Dec 20, 2011

0.3.0

Dec 7, 2011

0.2.8

Sep 9, 2011

0.2.7

Jul 13, 2011

0.2.6

May 24, 2011

0.2.5

Apr 29, 2011

0.2.4

Mar 9, 2011

0.2.3

Feb 25, 2011

0.2.2

Feb 16, 2011

0.2.1

Nov 17, 2010

0.2.0

Nov 16, 2010

0.1.0

Oct 28, 2010

0.1.0-pre3 pre-release

Oct 28, 2010

0.1.0-pre2 pre-release

Oct 26, 2010

0.1.0-pre1 pre-release

Oct 22, 2010

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mrjob-0.4.tar.gz (152.1 kB view hashes)

Uploaded May 1, 2013 Source

Hashes for mrjob-0.4.tar.gz

Hashes for mrjob-0.4.tar.gz
Algorithm	Hash digest
SHA256	`63652a456ed4aeff6a8d164adc9a642935133746d3ddf16d595c2ef70e518ba0`
MD5	`48e50fb60c8463b4f59ac0add5ecceee`
BLAKE2b-256	`99313fee08a6d62de477e7bd27aadab7f70bebe0860f5c3462168a8e9f58da29`