Skip to main content

pandalone: process data-trees with reconfigurable-paths

Project description

###########################################################
pandalone: process data-trees with reconfigurable-paths
###########################################################
|python-ver| |travis-status| |appveyor-status| |cover-status| |docs-status| \
|pypi-ver| |dependencies| |downloads-count| |github-issues| |proj-license|

.. image:: doc/_static/pandalone_logo.png
:width: 300 px
:align: center

**pandalone** is an open source Python 2/3 library for building
*component-functions* to process *hierarchical-data* using
*reconfigurable-paths*.

:Release: 0.1.3
:Documentation: https://pandalone.readthedocs.org/
:Source: https://github.com/pandalone/pandalone
:PyPI repo: https://pypi.python.org/pypi/pandalone
:Keywords: calculation, data, dependencies, engineering, excel, library,
numpy, pandas, processing, python, resolution, scientific,
simulink, tree, utility
:Copyright: 2015 European Commission (`JRC-IET
<https://ec.europa.eu/jrc/en/institutes/iet>`_)
:License: `EUPL 1.1+ <https://joinup.ec.europa.eu/software/page/eupl>`_

Currently only 2 portions of the envisioned functionality are ready for use:

- **xleash**: A mini-language for "throwing the rope" around rectangular areas
of Excel-sheets.
- **mappings**: Hierarchical string-like objects that may be used for
indexing, facilitating renaming keys and column-names at a later stage.


Our goal is to facilitate the composition of *engineering-models* from
loosely-coupled *components*.
Initially envisioned as an *indirection-framework* around *pandas* coupled
with a *dependency-resolver*, every such model should auto-adapt and process
only values available, and allow *remapping* of the paths accessing them,
to run on renamed/relocated *value-trees* without component-code modifications.

It is written for *python-3.4* but tested under both *python-2.7* and
*python-3.3+*, for *Windows* and *Linux*.

.. Note::
The project, as of May-2015, is considered at an alpha-stage,
without any released version in *pypi* yet.


.. _end-opening:
.. contents:: Table of Contents
:backlinks: top
.. _begin-intro:

Introduction
============

Overview
--------

At the most fundamental level, an "execution" or a "run" of any data-processing
can be thought like that::

.--------------. _____________ .-------------.
; DataTree ; | | ; DataTree ;
;--------------; ==> | <cfunc_1> | ==> ;--------------;
; /some/data ; | <cfunc_2> | ; /some/data ;
; /some/other ; | ... | ; /some/other ;
; /foo/bar ; |_____________| ; /foo/bar ;
'--------------' '--------------.


- The *data-tree* might come from *json*, *hdf5*, *excel-workbooks*, or
plain dictionaries and lists.
Its values are strings and numbers, *numpy-lists*, *pandas* or
*xray-datasets*, etc.

- The *component-functions* must abide to the following simple signature::

cfunc_do_something(pandelone, datatree)

and must not return any value, just read and write into the data-tree.

- Here is a simple component-function:

.. code-block:: python

def cfunc_standardize(pandelone, datatree):
pin, pon = pandelone.paths(),
df = datatree.get(pin.A)
df[pon.A.B_std] = df[pin.A.B] / df[pin.A.B].std()

- Notice the use of the *reconfigurable-paths* marked specifically as input or
output.

Project files and folders
-------------------------
The files and folders of the project are listed below::

+--pandalone/ ## (package) Python-code
+--pandalone/xleash ## (package) Python-Lassoing xl-refs
+--tests/ ## (package) Test-cases
+--doc/ ## Documentation folder
+--setup.py ## (script) The entry point for `setuptools`, installing, testing, etc
+--requirements/ ## (txt-files) Various pip and conda dependencies.
+--README.rst
+--CHANGES.rst
+--AUTHORS.rst
+--CONTRIBUTING.rst
+--LICENSE.txt


Design
------
See `architecture live-document
<https://docs.google.com/document/d/1P73jgcAEzR_Vw491DQR0zogdunJOj3qh0h_lvphdaHk>`_.



.. _faq:

FAQ
===

Why another XXX? What about YYY?
---------------------------------
These are the knowingly related python projects:

- `OpenMDAO <http://openmdao.org/>`_:
It has influenced pandalone's design.
It is planned to interoperate by converting to and from it's data-types.
But it works on python-2 only and its architecture needs attending from
programmers (no `setup.py`, no official test-cases).

- `PyDSTool <http://www2.gsu.edu/~matrhc/PyDSTool.htm>`_:
It does not overlap, since it does not cover IO and dependencies of data.
Also planned to interoperate with it (as soon as we have
a better grasp of it :-).
It has some issues with the documentation, but they are working on it.

- `xray <http://xray.readthedocs.org/en/stable/faq.html>`_:
Pandas for higher dimensions; data-trees should in principle work
with "xray".

- `Blaze <http://blaze.pydata.org>`_:
NumPy and Pandas interface to Big Data; data-trees should in principle work
with "blaze".

- `netCDF4 <http://unidata.github.io/netcdf4-python/>`_:
Hierarchical file-data-format similar to `hdf5`; a data-tree may derive
in principle from "netCDF4 ".

- `hdf5 <http://www.h5py.org/>`_:
Hierarchical file-data-format, `supported natively by pandas
<http://pandas.pydata.org/pandas-docs/version/0.15.2/io.html#io-hdf5>`_;
a data-tree may derive in principle from "netCDF4 ".

Which other projects/ideas have you reviewed when building this library?
------------------------------------------------------------------------
- `bubbles ETL <http://bubbles.databrewery.org/documentation.html>`_:
Processing-pipelines for (mostly) categorical data.

- `Data-protocols <http://dataprotocols.org/>`_:

- `JTSKit <https://github.com/okfn/jtskit-py>`_, A utility library for
working with `JSON Table Schema <http://dataprotocols.org/json-table-schema/>`_
in Python.
- `Data Packages <http://dataprotocols.org/data-packages/>`_

- `Celery <http://www.celeryproject.org/>`_:
Execute distributed asynchronous tasks using message passing on a single or
more worker servers using multiprocessing, Eventlet, or gevent.

- `Fuzzywuzzy <https://github.com/seatgeek/fuzzywuzzy>`_ and
`Jellyfish <https://github.com/sunlightlabs/jellyfish>`_:
Fuzzy string matching in python. Use it for writting code that can read
coarsely-known column-names.

- `"Other's people's messy data (and how not to hate it)"
<https://youtu.be/_eQ_8U5kruQ>`_,
PyCon 2015(Canada) presentation by Mali Akmanalp.


.. _glossary:

Glossary
========

data-tree
The *container* of data consumed and produced by a :term`model`, which
may contain also the model.
Its values are accessed using :term:`path` s.
It is implemented by :class:`pandalone.pandata.Pandel` as
a mergeable stack of :term:`JSON-schema` abiding trees of strings and
numbers, formed with:

- sequences,
- dictionaries,
- :mod:`pandas` instances, and
- URI-references.

value-tree
That part of the :term:`data-tree` that relates only to the I/O data
processed.

model
A collection of :term:`component` s and accompanying :term:`mappings`.

component
Encapsulates a data-transformation function, using :term:`path`
to refer to its inputs/outputs within the :term:`value-tree`.

path
A `/file/like` string functioning as the *id* of data-values
in the :term:`data-tree`.
It is composed of :term:`step`, and it follows the syntax of
the :term:`JSON-pointer`.

step
pstep
path-step
The parts between between two conjecutive slashes(`/`) within
a :term:`path`. The :class:`Pstep` facilitates their manipulation.

pmod
pmods
pmods-hierarchy
mapping
mappings
Specifies a transformation of an "origin" path to
a "destination" one (also called as "from" and "to" paths).
The mapping always transforms the *final* path-step, and it can
either *rename* or *relocate* that step, like that::

ORIGIN DESTINATION RESULT_PATH
------ ----------- -----------
/rename/path foo --> /rename/foo ## renaming
/relocate/path foo/bar --> /relocate/foo/bar ## relocation
/root a/b/c --> /a/b/c ## Relocates all /root sub-paths.

The hierarchy is formed by :class:`Pmod` instances,
which are build when parsing the :term:`mappings` list, above.

JSON-schema
The `JSON schema <http://json-schema.org/>`_ is an `IETF draft
<http://tools.ietf.org/html/draft-zyp-json-schema-03>`_
that provides a *contract* for what JSON-data is required for
a given application and how to interact with it.
JSON Schema is intended to define validation, documentation,
hyperlink navigation, and interaction control of JSON data.
You can learn more about it from this `excellent guide
<http://spacetelescope.github.io/understanding-json-schema/>`_,
and experiment with this `on-line validator <http://www.jsonschema.net/>`_.

JSON-pointer
JSON Pointer(:rfc:`6901`) defines a string syntax for identifying
a specific value within a JavaScript Object Notation (JSON) document.
It aims to serve the same purpose as *XPath* from the XML world,
but it is much simpler.



.. _begin-replacements:

.. |virtualenv| replace:: *virtualenv* (isolated Python environment)
.. _virtualenv: http://docs.python-guide.org/en/latest/dev/virtualenvs/

.. |pypi| replace:: *PyPi* repo
.. _pypi: https://pypi.python.org/pypi/pandalone

.. |winpython| replace:: *WinPython*
.. _winpython: http://winpython.github.io/

.. |anaconda| replace:: *Anaconda*
.. _anaconda: http://docs.continuum.io/anaconda/

.. |travis-status| image:: https://travis-ci.org/pandalone/pandalone.svg
:alt: Travis build status
:scale: 100%
:target: https://travis-ci.org/pandalone/pandalone

.. |appveyor-status| image:: https://ci.appveyor.com/api/projects/status/jayah84y3ae7ddfc?svg=true
:alt: Apveyor build status
:scale: 100%
:target: https://ci.appveyor.com/project/ankostis/pandalone

.. |cover-status| image:: https://coveralls.io/repos/pandalone/pandalone/badge.svg
:target: https://coveralls.io/r/pandalone/pandalone

.. |docs-status| image:: https://readthedocs.org/projects/pandalone/badge/
:alt: Documentation status
:scale: 100%
:target: https://readthedocs.org/builds/pandalone/

.. |pypi-ver| image:: https://img.shields.io/pypi/v/pandalone.svg
:target: https://pypi.python.org/pypi/pandalone/
:alt: Latest Version in PyPI

.. |python-ver| image:: https://img.shields.io/pypi/pyversions/pandalone.svg
:target: https://pypi.python.org/pypi/pandalone/
:alt: Supported Python versions

.. |downloads-count| image:: https://img.shields.io/pypi/dm/pandalone.svg?period=month
:target: https://pypi.python.org/pypi/pandalone/
:alt: Downloads

.. |github-issues| image:: https://img.shields.io/github/issues/pandalone/pandalone.svg
https://img.shields.io/github/issues/pandalone/pandalone.svg
:target: https://github.com/pandalone/pandalone/issues
:alt: Issues count

.. |proj-license| image:: https://img.shields.io/badge/license-EUPL%201.1%2B-blue.svg
:target: https://raw.githubusercontent.com/pandalone/pandalone/master/LICENSE.txt
:alt: Project License

.. |dependencies| image:: https://img.shields.io/requires/github/pandalone/pandalone.svg
:alt: Dependencies up-to-date?

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandalone-0.1.3.zip (119.0 kB view hashes)

Uploaded Source

Built Distribution

pandalone-0.1.3-py2.py3-none-any.whl (105.6 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page