pandagg

Python package provided to make elasticsearch aggregation easy, inspired by pandas library.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

What is it?

pandagg is a Python package providing a simple interface to manipulate ElasticSearch queries and aggregations.

Disclaimer :this is a pre-release version

Features

flexible aggregation and search queries declaration
query validation based on provided mapping
parsing of aggregation results in handy formats: tree with interactive navigation, csv-like tabular breakdown, and others
mapping interactive navigation

Usage

Documentation

Full documentation and user-guide are available here on read-the-docs.

Quick sneak peek

Elasticsearch dict syntax

>>> from pandagg.query import Query

>>> expected_query = {'bool': {'must': [
    {'terms': {'genres': ['Action', 'Thriller']}},
    {'range': {'rank': {'gte': 7}}},
    {'nested': {
        'path': 'roles',
        'query': {'bool': {'must': [
            {'term': {'roles.gender': {'value': 'F'}}},
            {'term': {'roles.role': {'value': 'Reporter'}}}]}
         }
    }}
]}}
>>> q = Query(expected_query)
>>> q
<Query>
bool
└── must
    ├── nested
    │   ├── path="roles"
    │   └── query
    │       └── bool
    │           └── must
    │               ├── term, field=roles.gender, value="F"
    │               └── term, field=roles.role, value="Reporter"
    ├── range, field=rank, gte=7
    └── terms, field=genres, values=['Action', 'Thriller']

DSL syntax

from pandagg.query import Nested, Bool, Query, Range, Term, Terms
>>> q = Query(
    Bool(must=[
        TermsFilter('genres', terms=['Action', 'Thriller']),
        Range('rank', gte=7),
        Nested(
            path='roles', 
            query=Bool(must=[
                Term('roles.gender', value='F'),
                Term('roles.role', value='Reporter')
            ])
        )
    ])
)

# serialized query is computed by `query_dict` method
>>> q.query_dict() == expected_query
True

Chained syntax

>>> from pandagg.query import Query, Range, Term

>>> q = Query()\
    .query({'terms': {'genres': ['Action', 'Thriller']}})\
    .nested(path='roles', _name='nested_roles', query=Term('roles.gender', value='F'))\
    .query(Range('rank', gte=7))\
    .query(Term('roles.role', value='Reporter'), parent='nested_roles')

>>> q
<Query>
bool
└── must
    ├── nested
    │   ├── path="roles"
    │   └── query
    │       └── bool
    │           └── must
    │               ├── term, field=roles.gender, value="F"
    │               └── term, field=roles.role, value="Reporter"
    ├── range, field=rank, gte=7
    └── terms, field=genres, values=['Action', 'Thriller']

Notes:

both DSL and dict syntaxes are accepted in Query compound clauses methods (query, nested, must etc).
the last query uses the nested clause _name to detect where it should be inserted

Installation

pip install pandagg

Dependencies

Hard dependency: treelib: 1.6.1 or higher

Soft dependency: to parse aggregation results as tabular dataframe: pandas

Motivations

pandagg only focuses on read operations (queries and aggregations), a high level python client elasticsearch-dsl already exists for ElasticSearch, but despite many qualities, in some cases its api was not always convenient when dealing with deeply nested queries and aggregations.

The fundamental difference between those libraries is how they deal with the tree structure of aggregation queries and their responses.

Suppose we have this aggregation structure: (types of agg don't matter). Let's call all of A, B, C, D our aggregation nodes, and the whole structure our tree.

A           (Terms agg)
└── B       (Filters agg)
    ├── C   (Avg agg)
    └── D   (Sum agg)

Question is who has the charge of storing the tree structure (how nodes are connected)?

In elasticsearch-dsl library, each aggregation node is responsible of knowing which are its direct children.

In pandagg, all nodes are agnostic about which are their parents/children, and a tree object is in charge of storing this structure. It is thus possible to add/update/remove aggregation nodes or sub-trees in specific locations of the initial tree, thus allowing more flexible ways to build your queries.

Contributing

All contributions, bug reports, bug fixes, documentation improvements, enhancements and ideas are welcome.

Roadmap

choose simple example to showcase pandagg in readme
write sphinx documentation
implement CI workflow: python2/3 tests, coverage
nested fields: automatic handling and validation in Query instances
Query.query, Agg.agg, Agg.groupby methods: allow passing of tree instance, in addition to current dict and node syntaxes
documentation; explain challenges induced by nested nodes syntaxes: for instance why are nested query clauses saved in children attribute before tree deserialization
extend test coverage on named queries serialization
evaluate interest and tradeoffs of using metaclasses like similarly to elasticsearch-dsl library to declare node classes
on aggregation nodes, ensure all allowed fields are listed
on aggregation response tree, use Query DSL to compute bucket filters
package versions for different ElasticSearch versions
remove Bucket nodes knowledge of their depth once this treelib issue is resolved

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.2.4

Mar 8, 2022

0.2.3

Sep 7, 2021

0.2.2

Aug 30, 2021

0.2.1

Jul 6, 2021

0.2.0

Feb 15, 2021

0.1.4

Jul 20, 2020

0.1.3

Jul 9, 2020

0.1.2

Jun 29, 2020

0.1.1

Jun 22, 2020

0.1.0

Jun 21, 2020

0.0.9

Jun 11, 2020

0.0.8

Jun 9, 2020

0.0.7

May 26, 2020

0.0.6

May 25, 2020

0.0.5

May 25, 2020

0.0.4

May 11, 2020

0.0.3

May 10, 2020

0.0.2

Apr 13, 2020

This version

0.0.1

Mar 2, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pandagg-0.0.1.tar.gz (68.2 kB view hashes)

Uploaded Mar 2, 2020 Source

Built Distribution

pandagg-0.0.1-py2-none-any.whl (139.9 kB view hashes)

Uploaded Mar 2, 2020 Python 2

Hashes for pandagg-0.0.1.tar.gz

Hashes for pandagg-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`1c4c2d761842238be6f48c2216989d5b49bbb4a9ec442f351fd13d0290992d0f`
MD5	`adb4aa04ad12a7e01681cb502d599c45`
BLAKE2b-256	`adb3627e2a6863c946e7ca2410f82e58489b9ac82aad083f939896b73fe52b24`

Hashes for pandagg-0.0.1-py2-none-any.whl

Hashes for pandagg-0.0.1-py2-none-any.whl
Algorithm	Hash digest
SHA256	`256a0436568af02cc1a6ed444bc9346c2e3a5b4c7977e55b4558bc22a95604b6`
MD5	`37899bc6b23755697b4cca53c8bc2693`
BLAKE2b-256	`1311aa2bb303cc8ecc6f7f22e48444135130258c2899156456cd297b27f34522`