apiarist

Python Hive query framework

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- System :: Distributed Computing

Project description

# Apiarist

A python 2.5+ package for defining Hive queries which can be run on AWS EMR.

It is, in its current form, only addressing a very narrow use-case.
Reading large CSV files into a Hive database, running a Hive query, and outputting the results to a CSV file.

Future versions will endeavour to extend the input/output formats and be runnable locally.

It is modeled on [mrjob](https://github.com/Yelp/mrjob) and attempts to present a similar API and use similar common variables to cooperate with `boto`.

## A simple Hive job

You will need to provide four methods:

- `table()` the name of the table that your query will select from.
- `input_columns()` the columns in the source data file.
- `output_columns()` the columns that your query will output.
- `query` the HiveQL query.

This code lives in `/examples`.

```python
from apiarist.job import HiveJob

class EmailRecipientsSummary(HiveJob):

def table(self):
return 'emails_sent'

def input_columns(self):
return [
('day', 'STRING'),
('weekday', 'INT'),
('sent', 'BIGINT')
]

def output_columns(self):
return [
('year', 'INT'),
('weekday', 'INT'),
('sent', 'BIGINT')
]

def query(self):
return "SELECT YEAR(day), weekday, SUM(sent) FROM emails_sent GROUP BY YEAR(day), weekday;"

if __name__ == "__main__":
EmailRecipientsSummary().run()
```

### Try it out

Locally (must have a Hive server available):

python email_recipients_summary.py -r local /path/to/your/local/file.csv

EMR:

python email_recipients_summary.py -r emr s3://path/to/your/S3/files/

## Command-line options

Arguments can be passed to jobs on the command line, or programmatically with an array of options. Argument handling uses the [optparse](https://docs.python.org/2/library/optparse.html) module.

Various options can be passed to control the running of the job. In particular the AWS/EMR options.

- `-r` the run mode. Either `local` or `emr` (default is `local`)
- `--output-dir` where the results of the job will go.
- `--s3-scratch-uri` the bucket in which all the temporary files can go.
- `--ec2-instance-type` the base instance type. Default is `m3.xlarge`
- `--ec2-master-instance-type` if you want the master type to be different.
- `--num-ec2-instances` number of instances (including the master). Default is `2`.
- `--ami-version` the ami version. Default is `latest`.
- `--hive-version`. Default is `latest`.

### Passing options to your jobs

Jobs can be configured to accept arguments.

To do this, add the following method to your job class to configutr the options:

def configure_options(self):
super(EmailRecipientsSummary, self).configure_options()
self.add_passthrough_option('--year', dest='year')

And then use the option by providing it in the command line arguments, like this:

python email_recipients_summary.py -r local /path/to/your/local/file.csv --year 2014

Then incorporating it into your HiveQL query like this:

def query(self):
q = "SELECT YEAR(day), weekday, SUM(sent) "
q += "FROM emails_sent "
q += "WHERE YEAR(day) = {0} ".format(self.options.year)
q += "GROUP BY YEAR(day), weekday;"
return q

## License

Apiarist source code is released under Apache 2 License. Check LICENSE file for more information.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Operating System
- OS Independent
Programming Language
Topic
- System :: Distributed Computing

Release history Release notifications | RSS feed

0.2.3

Nov 14, 2018

0.2.2

Nov 14, 2018

0.2.1

Nov 8, 2018

0.2.0

Oct 6, 2017

0.1.14

Dec 4, 2015

0.1.13

Dec 4, 2015

0.1.12

Jun 29, 2015

0.1.11

Dec 12, 2014

0.1.10

Sep 16, 2014

0.1.9

Sep 16, 2014

0.1.8

Sep 10, 2014

0.1.7

Sep 10, 2014

0.1.5

Sep 9, 2014

0.1.4

Sep 9, 2014

0.1.3

Sep 1, 2014

0.1.2

Sep 1, 2014

0.1.1

Aug 27, 2014

0.1.0

Aug 25, 2014

0.0.14

Aug 22, 2014

0.0.13

Aug 12, 2014

This version

0.0.12

Aug 12, 2014

0.0.11

Aug 12, 2014

0.0.10

Aug 12, 2014

0.0.9

Aug 12, 2014

0.0.8

Aug 11, 2014

0.0.7

Aug 4, 2014

0.0.6

Jul 29, 2014

0.0.3

Jul 3, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

apiarist-0.0.12.tar.gz (38.6 kB view hashes)

Uploaded Aug 12, 2014 Source

Hashes for apiarist-0.0.12.tar.gz

Hashes for apiarist-0.0.12.tar.gz
Algorithm	Hash digest
SHA256	`18faee3bb14d86e77bb07235fa261f9f8be98622b4229262d06cc21691244fc6`
MD5	`af3041566f8b37266e152327996caedb`
BLAKE2b-256	`7d3576500653e099cc05a169beacdf4bddc4e21f8306bc03e429b266d409004d`