skip to navigation
skip to content

Not Logged In

mrjob 0.4

Python MapReduce framework

Package Documentation

Latest Version: 0.4.2


.. image::

mrjob is a Python 2.5+ package that helps you write and run Hadoop Streaming

`v0.3.4.1 documentation <>`_

`v0.4-dev documentation <>`_

mrjob fully supports Amazon's Elastic MapReduce (EMR) service, which allows you
to buy time on a Hadoop cluster on an hourly basis. It also works with your own
Hadoop cluster.

Some important features:

* Run jobs on EMR, your own Hadoop cluster, or locally (for testing).
* Write multi-step jobs (one map-reduce step feeds into the next)
* Duplicate your production environment inside Hadoop
    * Upload your source tree and put it in your job's ``$PYTHONPATH``
    * Run make and other setup scripts
    * Set environment variables (e.g. ``$TZ``)
    * Easily install python packages from tarballs (EMR only)
    * Setup handled transparently by ``mrjob.conf`` config file
* Automatically interpret error logs from EMR
* SSH tunnel to hadoop job tracker on EMR
* Minimal setup
    * To run on EMR, set ``$AWS_ACCESS_KEY_ID`` and ``$AWS_SECRET_ACCESS_KEY``
    * To run on your Hadoop cluster, install ``simplejson`` and make sure
      ``$HADOOP_HOME`` is set.


From PyPI:

``pip install mrjob``

From source:

``python install``

A Simple Map Reduce Job

Code for this example and more live in ``mrjob/examples``.

.. code:: python

   """The classic MapReduce job: count the frequency of words.
   from mrjob.job import MRJob
   import re

   WORD_RE = re.compile(r"[\w']+")

   class MRWordFreqCount(MRJob):

       def mapper(self, _, line):
           for word in WORD_RE.findall(line):
               yield (word.lower(), 1)

       def combiner(self, word, counts):
           yield (word, sum(counts))

       def reducer(self, word, counts):
           yield (word, sum(counts))

    if __name__ == '__main__':

Try It Out!


    # locally
    python mrjob/examples/ README.rst > counts
    # on EMR
    python mrjob/examples/ README.rst -r emr > counts
    # on your Hadoop cluster
    python mrjob/examples/ README.rst -r hadoop > counts

Setting up EMR on Amazon

* create an `Amazon Web Services account <>`_
* sign up for `Elastic MapReduce <>`_
* Get your access and secret keys (click "Security Credentials" on
  `your account page <>`_)
* Set the environment variables ``$AWS_ACCESS_KEY_ID`` and
  ``$AWS_SECRET_ACCESS_KEY`` accordingly

Advanced Configuration

To run in other AWS regions, upload your source tree, run ``make``, and use
other advanced mrjob features, you'll need to set up ``mrjob.conf``. mrjob looks
for its conf file in:

* The contents of ``$MRJOB_CONF``
* ``~/.mrjob.conf``
* ``/etc/mrjob.conf``

See `the mrjob.conf documentation
<>`_ for more information.

Project Links

* `Source code <>`_
* `Documentation <>`_
* `Discussion group <>`_


* `Hadoop MapReduce <>`_
* `Elastic MapReduce <>`_

More Information

* `PyCon 2011 mrjob overview <>`_
* `Introduction to Recommendations and MapReduce with mrjob <>`_
  (`source code <>`_)
* `Social Graph Analysis Using Elastic MapReduce and PyPy <>`_

Thanks to `Greg Killion <>`_
(` <>`_) for the logo.
File Type Py Version Uploaded on Size
mrjob-0.4.tar.gz (md5) Source 2013-05-01 148KB
  • Downloads (All Versions):
  • 4833 downloads in the last day
  • 20553 downloads in the last week
  • 108610 downloads in the last month