tafra

Tafra: innards of a dataframe

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

https://travis-ci.org/petbox-dev/tafra.svg?branch=master

The tafra began life as a thought experiment: how could we reduce the idea of a dataframe (as expressed in libraries like pandas or languages like R) to its useful essence, while carving away the cruft? The original proof of concept stopped at “group by”.

This library expands on the proof of concept to produce a practically useful tafra, which we hope you may find to be a helpful lightweight substitute for certain uses of pandas.

A tafra is, more-or-less, a set of named columns or dimensions. Each of these is a typed numpy array of consistent length, representing the values for each column by rows.

The library provides lightweight syntax for manipulating rows and columns, support for managing data types, iterators for rows and sub-frames, pandas-like “transform” support and conversion from pandas Dataframes, and SQL-style “group by” and join operations.

Tafra	Tafra
Aggregations	Union, GroupBy, Transform, IterateBy, InnerJoin, LeftJoin, CrossJoin
Aggregation Helpers	union, union_inplace, group_by, transform, iterate_by, inner_join, left_join, cross_join
Constructors	as_tafra, from_dataframe, from_series
Destructors	to_records, to_list, to_array
Properties	rows, columns, data, dtypes, size, ndim, shape
Iter Methods	iterrows, itertuples, itercols
Dict-like Methods	keys, values, items, get, update, update_inplace, update_dtypes, update_dtypes_inplace
Other Helper Methods	rename, rename_inplace, coalesce, coalesce_inplace, _coalesce_dtypes, delete, delete_inplace
Printer Methods	pprint, pformat, to_html

Getting Started

Install the library with pip:

pip install tafra

A short example

>>> from tafra import Tafra

>>> t = Tafra({
...    'x': np.array([1, 2, 3, 4]),
...    'y': np.array(['one', 'two', 'one', 'two'], dtype='object'),
... })

>>> t.pformat()
Tafra(data = {
 'x': array([1, 2, 3, 4]),
 'y': array(['one', 'two', 'one', 'two'])},
dtypes = {
 'x': 'int', 'y': 'object'},
rows = 4)

>>> print('List:', '\n', t.to_list())
List:
 [array([1, 2, 3, 4]), array(['one', 'two', 'one', 'two'], dtype=object)]

>>> print('Records:', '\n', tuple(t.to_records()))
Record:
 ((1, 'one'), (2, 'two'), (3, 'one'), (4, 'two'))

>>> gb = t.group_by(
...     ['y'], {'x': sum}
... )

>>> print('Group By:', '\n', gb.pformat())
Group By:
Tafra(data = {
 'x': array([4, 6]), 'y': array(['one', 'two'])},
dtypes = {
 'x': 'int', 'y': 'object'},
rows = 2)

Flexibility

Have some code that works with pandas, or just a way of doing things that you prefer? tafra is flexible:

>>> df = pd.DataFrame(np.c_[
...     np.array([1, 2, 3, 4]),
...     np.array(['one', 'two', 'one', 'two'])
... ], columns=['x', 'y'])

>>> t = Tafra.from_dataframe(df)

And going back is just as simple:

>>> df = pd.DataFrame(t.data)

Timings

In this case, lightweight also means performant. Beyond any additional features added to the library, tafra should provide the necessary base for organizing data structures for numerical processing. One of the most important aspects is fast access to the data itself. By minizing abstraction to access the underlying numpy arrays, tafra provides over an order of magnitude increase in performance.

Import note If you assign directly to the Tafra.data or Tafra._data attributes, you must call Tafra._coalesce_dtypes afterwards in order to ensure the typing is consistent.

Construct a Tafra and a DataFrame:

>>> t = Tafra({
...    'x': np.array([1, 2, 3, 4]),
...    'y': np.array(['one', 'two', 'one', 'two'], dtype='object'),
... })

>>> df = pd.DataFrame(t.data)

Read Operations

Direct access:

>>> %timemit x = t._data['x']
55.3 ns Â± 5.64 ns per loop (mean Â± std. dev. of 7 runs, 10000000 loops each)

Indirect with some penalty to support Tafra slicing and numpy’s advanced indexing:

>>> %timemit x = t['x']
219 ns Â± 71.6 ns per loop (mean Â± std. dev. of 7 runs, 1000000 loops each)

pandas timing:

>>> %timemit x = df['x']
1.55 Âµs Â± 105 ns per loop (mean Â± std. dev. of 7 runs, 1000000 loops each)

As fast as pandas gets:

>>> where_col = list(df.columns).index('x')
>>> %timeit x = df.values[:, where_col]
48 Âµs Â± 7.77 Âµs per loop (mean Â± std. dev. of 7 runs, 10000 loops each)

Assignment Operations

Direct access:

>>> x = np.arange(4)

>>> %timeit tf._data['x'] = x
65 ns Â± 5.55 ns per loop (mean Â± std. dev. of 7 runs, 10000000 loops each)

Indidrect:

>>> %timeit tf['x'] = x
7.39 Âµs Â± 950 ns per loop (mean Â± std. dev. of 7 runs, 100000 loops each)

pandas timing:

>>> %timeit df['x'] = x
47.8 Âµs Â± 3.53 Âµs per loop (mean Â± std. dev. of 7 runs, 10000 loops each)

Version History

1.0.1

Add iter functions
Add map functions
Various constructor improvements

1.0.0

Initial Release

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.0.10

Nov 22, 2022

1.0.9

Oct 5, 2020

1.0.8

Oct 2, 2020

1.0.7

Jul 9, 2020

1.0.6

Jun 17, 2020

1.0.5

Jun 13, 2020

1.0.4

Jun 12, 2020

1.0.3

Jun 7, 2020

1.0.2

Jun 4, 2020

This version

1.0.1

Jun 2, 2020

1.0.0

Jun 1, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tafra-1.0.1.tar.gz (29.8 kB view hashes)

Uploaded Jun 2, 2020 Source

Built Distribution

tafra-1.0.1-py3-none-any.whl (33.5 kB view hashes)

Uploaded Jun 2, 2020 Python 3

Hashes for tafra-1.0.1.tar.gz

Hashes for tafra-1.0.1.tar.gz
Algorithm	Hash digest
SHA256	`4baada568127080ff2f322e655f1530db872bde8b8017e23524735eda5d89ecc`
MD5	`2922060737661a89f3055c562b725a93`
BLAKE2b-256	`2f8976afde964798be6b0a8b4571a62573a57c3b51ec3eaa70dda0ebd9e6f289`

Hashes for tafra-1.0.1-py3-none-any.whl

Hashes for tafra-1.0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`58b2ef0e0c190845bf2e06e27b2e30b3ca919927deecb54322679dac60f6e7a8`
MD5	`ba563c59f8bffff17ab9c2b8199c4de2`
BLAKE2b-256	`7fce752163f60c519f6100abb2f0e54470b724d143a3efd8477c17732baaa214`