Skip to main content

A command-line tool for BigML.io, the public BigML API

Project description

BigMLer - A command-line tool for BigML's API
=============================================

BigMLer makes `BigML <https://bigml.com>`_ even easier.

BigMLer wraps `BigML's API Python bindings <http://bigml.readthedocs.org>`_ to
offer a high-level command-line script to easily create and publish datasets
and models, create ensembles,
make local predictions from multiple models, and simplify many other machine
learning tasks. For additional information, see
the
`full documentation for BigMLer on Read the Docs <http://bigmler.readthedocs.org>`_.

BigMLer is open sourced under the `Apache License, Version
2.0 <http://www.apache.org/licenses/LICENSE-2.0.html>`_.

Support
=======

Please report problems and bugs to our `BigML.io issue
tracker <https://github.com/bigmlcom/io/issues>`_.

Discussions about the different bindings take place in the general
`BigML mailing list <http://groups.google.com/group/bigml>`_. Or join us
in our `Campfire chatroom <https://bigmlinc.campfirenow.com/f20a0>`_.

Requirements
============

Python 2.7 is currently supported by BigMLer.

BigMLer requires `bigml 1.2.2 <https://github.com/bigmlcom/python>`_ or
higher. Using proportional missing strategy will additionally request
the use of the `numpy <http://www.numpy.org/>`_ and
`scipy <http://www.scipy.org/>`_ libraries. They are not
automatically installed as a dependency, as they are quite heavy and
exclusively required in this case. Therefore, they have been left for
the user to install them if required.

Note that using proportional missing strategy for local predictions can also
require `numpy <http://www.numpy.org/>`_ and
`scipy <http://www.scipy.org/>`_ libraries. They are not installed by
default. Check the bindings documentation
for more info.

BigMLer Installation
====================

To install the latest stable release with
`pip <http://www.pip-installer.org/>`_::

$ pip install bigmler

You can also install the development version of bigmler directly
from the Git repository::

$ pip install -e git://github.com/bigmlcom/bigmler.git#egg=bigmler

For a detailed description of install instructions on Windows see the
`BigMLer on Windows <#bigmler-on-windows>`_ section.


BigML Authentication
====================

All the requests to BigML.io must be authenticated using your username
and `API key <https://bigml.com/account/apikey>`_ and are always
transmitted over HTTPS.

BigML module will look for your username and API key in the environment
variables ``BIGML_USERNAME`` and ``BIGML_API_KEY`` respectively. You can
add the following lines to your ``.bashrc`` or ``.bash_profile`` to set
those variables automatically when you log in::

export BIGML_USERNAME=myusername
export BIGML_API_KEY=ae579e7e53fb9abd646a6ff8aa99d4afe83ac291

Otherwise, you can initialize directly when running the BigMLer
script as follows::

bigmler --train data/iris.csv --username myusername --api_key ae579e7e53fb9abd646a6ff8aa99d4afe83ac291

For a detailed description of authentication instructions on Windows see the
`BigMLer on Windows <#bigmler-on-windows>`_ section.


BigMLer on Windows
==================

To install BigMLer on Windows environments, you'll need `Python for Windows
(v.2.7.x) <http://www.python.org/download/>`_ installed.

In addition to that, you'll need the ``pip`` tool to install BigMLer. To
install pip, first you need to open your command line window (write ``cmd`` in
the input field that appears when you click on ``Start`` and hit ``enter``),
download this `python file <http://python-distribute.org/distribute_setup.py>`_
and execute it::

c:\Python27\python.exe distribute_setup.py

After that, you'll be able to install ``pip`` by typing the following command::

c:\Python27\Scripts\easy_install.exe pip

And finally, to install BigMLer, just type::

c:\Python27\Scripts\pip.exe install bigmler

and BigMLer should be installed in your computer. Then
issuing::

bigmler --version

should show BigMLer version information.

Finally, to start using BigMLer to handle your BigML resources, you need to
set your credentials in BigML for authentication. If you want them to be
permanently stored in your system, use::

setx BIGML_USERNAME myusername
setx BIGML_API_KEY ae579e7e53fb9abd646a6ff8aa99d4afe83ac291


BigML Development Mode
======================

Also, you can instruct BigMLer to work in BigML's Sandbox
environment by using the parameter ``---dev``::

bigmler --train data/iris.csv --dev

Using the development flag you can run tasks under 1 MB without spending any of
your BigML credits.

Using BigMLer
=============

To run BigMLer you can use the console script directly. The `--help` option will
describe all the available options::

bigmler --help

Alternatively you can just call bigmler as follows::

python bigmler.py --help

This will display the full list of optional arguments. You can read a brief
explanation for each option below.

Quick Start
===========

Let's see some basic usage examples. Check the `installation` and `authentication`
sections in `BigMLer on Read the Docs <http://bigmler.readthedocs.org>`_ if you are not familiar with BigML.

Basics
------

You can create a new model just with ::

bigmler --train data/iris.csv

If you check your `dashboard at BigML <https://bigml.com/dashboard>`_, you will
see a new source, dataset, and model. Isn't it magic?

You can generate predictions for a test set using::

bigmler --train data/iris.csv --test data/test_iris.csv

You can also specify a file name to save the newly created predictions::

bigmler --train data/iris.csv --test data/test_iris.csv --output predictions

If you do not specify the path to an output file, BigMLer will auto-generate one for you under a
new directory named after the current date and time (e.g., `MonNov1212_174715/predictions.csv`).
With ``--prediction-info``
flag set to ``brief`` only the prediction result will be stored (default is
``normal`` and includes confidence information).

A different ``objective field`` (the field that you want to predict) can be selected using::

bigmler --train data/iris.csv --test data/test_iris.csv --objective 'sepal length'

If you do not explicitly specify an objective field, BigML will default to the last
column in your dataset.

Also, if your test file uses a particular field separator for its data,
you can tell BigMLer using ``--test-separator``.
For example, if your test file uses the tab character as field separator the
call should be like::

bigmler --train data/iris.csv --test data/test_iris.tsv \
--test-separator '\t'

If you don't provide a file name for your training source, BigMLer will try to
read it from the standard input::

cat data/iris.csv | bigmler --train

BigMLer will try to use the locale of the model both to create a new source
(if ``--train`` flag is used) and to interpret test data. In case
it fails, it will try ``en_US.UTF-8``
or ``English_United States.1252`` and a warning message will be printed.
If you want to change this behaviour you can specify your preferred locale::

bigmler --train data/iris.csv --test data/test_iris.csv \
--locale "English_United States.1252"

If you check your working directory you will see that BigMLer creates a file
with the
model ids that have been generated (e.g., FriNov0912_223645/models).
This file is handy if then you want to use those model ids to generate local
predictions. BigMLer also creates a file with the dataset id that has been
generated (e.g., TueNov1312_003451/dataset) and another one summarizing
the steps taken in the session progress: ``bigmler_sessions``. You can also
store a copy of every created or retrieved resource in your output directory
(e.g., TueNov1312_003451/model_50c23e5e035d07305a00004f) by setting the flag
``--store``.

Prior Versions Compatibility Issues
-----------------------------------

BigMLer will accept flags written with underscore as word separator like
``--clear_logs`` for compatibility with prior versions. Also ``--field-names``
is accepted, although the more complete ``--field-attributes`` flag is
preferred. ``--stat_pruning`` and ``--no_stat_pruning`` are discontinued
and their effects can be achived by setting the actual ``--pruning`` flag
to ``statistical`` or ``no-pruning`` values respectively.

Running the Tests
-----------------

To run the tests you will need to install
`lettuce <http://packages.python.org/lettuce/tutorial/simple.html>`_::

$ pip install lettuce

and set up your authentication via environment variables, as explained
above. With that in place, you can run the test suite simply by::

$ cd tests
$ lettuce

Additional Information
----------------------

For additional information, see
the `full documentation for BigMLer on Read the Docs <http://bigmler.readthedocs.org>`_.


.. :changelog:

History
-------

1.8.1 (2014-05-04)
~~~~~~~~~~~~~~~~~~

- Changing the Gazibit report for shared resources to include the model
shared url in embedded format.
- Fixing bug: train and tests data could not be read from stdin

1.8.0 (2014-04-29)
~~~~~~~~~~~~~~~~~~

- Adding the ``analyze`` subcommand. The subcommand presents new features,
such as:

--cross-validation, that performs k-fold cross-validation and
--features, that selects the best features to increase accuracy (or
any other evaluation metric) using a smart search algorithm.
--nodes, that selects the node threshold that ensures best accuracy (or
any other evaluation metric) in user defined range of nodes.

1.7.1 (2014-04-21)
~~~~~~~~~~~~~~~~~~

- Fixing bug: --no-upload flag was not really used.

1.7.0 (2014-04-20)
~~~~~~~~~~~~~~~~~~

- Adding the --reports option to generate Gazibit reports.

1.6.0 (2014-04-18)
~~~~~~~~~~~~~~~~~~

- Adding the --shared flag to share the created dataset, model and evaluation.

1.5.1 (2014-04-04)
~~~~~~~~~~~~~~~~~~

- Fixing bug for model building, when objective field was specified and
no --max-category was present the user given objective was not used.
- Fixing bug: max-category data stored even when --max-category was not
used.

1.5.0 (2014-03-24)
~~~~~~~~~~~~~~~~~~

- Adding --missing-strategy option to allow different prediction strategies
when a missing value is found in a split field. Available for local
predictions, batch predictions and evaluations.
- Adding new --delete options: --newer-than and --older-than to delete lists
of resources according to their creation date.
- Adding --multi-dataset flag to generate a new dataset from a list of
equally structured datasets.

1.4.7 (2014-03-14)
~~~~~~~~~~~~~~~~~~

- Bug fixing: resume from multi-label processing from dataset was not working.
- Bug fixing: max parallel resource creation check did not check that all the
older tasks ended, only the last of the slot. This caused
more tasks than permitted to be sent in parallel.
- Improving multi-label training data uploads by zipping the extended file and
transforming booleans from True/False to 1/0.

1.4.6 (2014-02-21)
~~~~~~~~~~~~~~~~~~

- Bug fixing: dataset objective field is not updated each time --objective
is used, but only if it differs from the existing objective.

1.4.5 (2014-02-04)
~~~~~~~~~~~~~~~~~~

- Storing the --max-categories info (its number and the chosen `other` label)
in user_metadata.

1.4.4 (2014-02-03)
~~~~~~~~~~~~~~~~~~

- Fix when using the combined method in --max-categories models.
The combination function now uses confidence to choose the predicted
category.
- Allowing full content text fields to be also used as --max-categories
objective fields.
- Fix solving objective issues when its column number is zero.

1.4.3 (2014-01-28)
~~~~~~~~~~~~~~~~~~

- Adding the --objective-weights option to point to a CSV file containing the
weights assigned to each class.
- Adding the --label-aggregates option to create new aggregate fields on the
multi label fields such as count, first or last.

1.4.2 (2014-01-24)
~~~~~~~~~~~~~~~~~~

- Fix in local random forests' predictions. Sometimes the fields used in all
the models were not correctly retrieved and some predictions could be
erroneus.

1.4.1 (2014-01-23)
~~~~~~~~~~~~~~~~~~

- Fix to allow the input data for multi-label predictions to be expanded.
- Fix to retrieve from the models definition info the labels that were
given by the user in its creation in multi-label models.

1.4.0 (2014-01-20)
~~~~~~~~~~~~~~~~~~

- Adding new --balance option to automatically balance all the classes evenly.
- Adding new --weight-field option to use the field contents as weights for
the instances.

1.3.0 (2014-01-17)
~~~~~~~~~~~~~~~~~~

- Adding new --source-attributes, --ensemble-attributes,
--evaluation-attributes and --batch-prediction-attributes options.
- Refactoring --multi-label resources to include its related info in
the user_metadata attribute.
- Refactoring the main routine.
- Adding --batch-prediction-tag for delete operations.

1.2.3 (2014-01-16)
~~~~~~~~~~~~~~~~~~

- Fix to transmit --training-separator when creating remote sources.

1.2.2 (2014-01-14)
~~~~~~~~~~~~~~~~~~

- Fix for multiple multi-label fields: headers did not match rows contents in
some cases.

1.2.1 (2014-01-12)
~~~~~~~~~~~~~~~~~~

- Fix for datasets generated using the --new-fields option. The new dataset
was not used in model generation.

1.2.0 (2014-01-09)
~~~~~~~~~~~~~~~~~~

- Adding --multi-label-fields to provide a comma-separated list of multi-label
fields in a file.

1.1.0 (2014-01-08)
~~~~~~~~~~~~~~~~~~

- Fix for ensembles' local predictions when order is used in tie break.
- Fix for duplicated model ids in models file.
- Adding new --node-threshold option to allow node limit in models.
- Adding new --model-attributes option pointing to a JSON file containing
model attributes for model creation.

1.0.1 (2014-01-06)
~~~~~~~~~~~~~~~~~~

- Fix for missing modules during installation.

1.0 (2014-01-02)
~~~~~~~~~~~~~~~~~~

- Adding the --max-categories option to handle datasets with a high number of
categories.
- Adding the --method combine option to produce predictions with the sets
of datasets generated using --max-categories option.
- Fixing problem with --max-categories when the categorical field is not
a preferred field of the dataset.
- Changing the --datasets option behaviour: it points to a file where
dataset ids are stored, one per line, and now it reads all of them to be
used in model and ensemble creation.

0.7.2 (2013-12-20)
~~~~~~~~~~~~~~~~~~

- Adding confidence to predictions output in full format

0.7.1 (2013-12-19)
~~~~~~~~~~~~~~~~~~

- Bug fixing: multi-label predictions failed when the --ensembles option
is used to provide the ensemble information

0.7.0 (2013-11-24)
~~~~~~~~~~~~~~~~~~

- Bug fixing: --dataset-price could not be set.
- Adding the threshold combination method to the local ensemble.

0.6.1 (2013-11-23)
~~~~~~~~~~~~~~~~~~

- Bug fixing: --model-fields option with absolute field names was not
compatible with multi-label classification models.
- Changing resource type checking function.
- Bug fixing: evaluations did not use the given combination method.
- Bug fixing: evaluation of an ensemble had turned into evaluations of its
models.
- Adding pruning to the ensemble creation configuration options

0.6.0 (2013-11-08)
~~~~~~~~~~~~~~~~~~

- Changing fields_map column order: previously mapped dataset column
number to model column number, now maps model column number to
dataset column number.
- Adding evaluations to multi-label models.
- Bug fixing: unicode characters greater than ascii-127 caused crash in
multi-label classification

0.5.0 (2013-10-08)
~~~~~~~~~~~~~~~~~~

- Adapting to predictions issued by the high performance prediction server and
the 0.9.0 version of the python bindings.
- Support for shared models using the same version on python bindings.
- Support for different server names using environment variables.

0.4.1 (2013-10-02)
~~~~~~~~~~~~~~~~~~

- Adding ensembles' predictions for multi-label objective fields
- Bug fixing: in evaluation mode, evaluation for --dataset and
--number-of-models > 1 did not select the 20% hold out instances to test the
generated ensemble.

0.4.0 (2013-08-15)
~~~~~~~~~~~~~~~~~~

- Adding text analysis through the corresponding bindings

0.3.7 (2013-09-17)
~~~~~~~~~~~~~~~~~~

- Adding support for multi-label objective fields
- Adding --prediction-headers and --prediction-fields to improve
--prediction-info formatting options for the predictions file
- Adding the ability to read --test input data from stdin
- Adding --seed option to generate different splits from a dataset

0.3.6 (2013-08-21)
~~~~~~~~~~~~~~~~~~

- Adding --test-separator flag

0.3.5 (2013-08-16)
~~~~~~~~~~~~~~~~~~

- Bug fixing: resume crash when remote predictions were not completed
- Bug fixing: Fields object for input data dict building lacked fields
- Bug fixing: test data was repeated in remote prediction function
- Bug fixing: Adding replacement=True as default for ensembles' creation

0.3.4 (2013-08-09)
~~~~~~~~~~~~~~~~~~

- Adding --max-parallel-evaluations flag
- Bug fixing: matching seeds in models and evaluations for cross validation

0.3.3 (2013-08-09)
~~~~~~~~~~~~~~~~~~
- Changing --model-fields and --dataset-fields flag to allow adding/removing
fields with +/- prefix
- Refactoring local and remote prediction functions
- Adding 'full data' option to the --prediction-info flag to join test input
data with prediction results in predictions file
- Fixing errors in documentation and adding install for windows info

0.3.2 (2013-07-04)
~~~~~~~~~~~~~~~~~~
- Adding new flag to control predictions file information
- Bug fixing: using default sample-rate in ensemble evaluations
- Adding standard deviation to evaluation measures in cross-validation
- Bug fixing: using only-model argument to download fields in models

0.3.1 (2013-05-14)
~~~~~~~~~~~~~~~~~~

- Adding delete for ensembles
- Creating ensembles when the number of models is greater than one
- Remote predictions using ensembles

0.3.0 (2013-04-30)
~~~~~~~~~~~~~~~~~~

- Adding cross-validation feature
- Using user locale to create new resources in BigML
- Adding --ensemble flag to use ensembles in predictions and evaluations

0.2.1 (2013-03-03)
~~~~~~~~~~~~~~~~~~

- Deep refactoring of main resources management
- Fixing bug in batch_predict for no headers test sets
- Fixing bug for wide dataset's models than need query-string to retrieve all fields
- Fixing bug in test asserts to catch subprocess raise
- Adding default missing tokens to models
- Adding stdin input for --train flag
- Fixing bug when reading descriptions in --field-attributes
- Refactoring to get status from api function
- Adding confidence to combined predictions

0.2.0 (2012-01-21)
~~~~~~~~~~~~~~~~~~
- Evaluations management
- console monitoring of process advance
- resume option
- user defaults
- Refactoring to improve readability

0.1.4 (2012-12-21)
~~~~~~~~~~~~~~~~~~

- Improved locale management.
- Adds progressive handling for large numbers of models.
- More options in field attributes update feature.
- New flag to combine local existing predictions.
- More methods in local predictions: plurality, confidence weighted.

0.1.3 (2012-12-06)
~~~~~~~~~~~~~~~~~~

- New flag for locale settings configuration.
- Filtering only finished resources.

0.1.2 (2012-12-06)
~~~~~~~~~~~~~~~~~~

- Fix to ensure windows compatibility.

0.1.1 (2012-11-07)
~~~~~~~~~~~~~~~~~~

- Initial release.

Project details


Release history Release notifications | RSS feed

This version

1.8.1

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bigmler-1.8.1.tar.gz (108.7 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page