No project description provided

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

data-tools(et)

data-toolset is designed to simplify your data processing tasks by providing a more user-friendly alternative to the traditional JAR utilities like avro-tools and parquet-tools. With this Python package, you can effortlessly handle various data file formats, including Avro and Parquet, using a simple and intuitive command-line interface.

Installation

Python 3.8, Python 3.9 and 3.10 are supported and tested (to some extent).

python -m pip install --user data-toolset

Usage

$ data-toolset -h
usage: data-toolset [-h] {head,tail,meta,schema,stats,query,validate,merge,count,to_json,to_csv,to_avro,to_parquet,random_sample} ...

positional arguments:
  {head,tail,meta,schema,stats,query,validate,merge,count,to_json,to_csv,to_avro,to_parquet,random_sample}
                        commands
    head                Print the first N records from a file
    tail                Print the last N records from a file
    meta                Print a file's metadata
    schema              Print the Avro schema for a file
    stats               Print statistics about a file
    query               Query a file
    validate            Validate a file
    merge               Merge multiple files into one
    count               Count the number of records in a file
    to_json             Convert a file to JSON format
    to_csv              Convert a file to CSV format
    to_avro             Convert a file to Avro format
    to_parquet          Convert a file to Parquet format
    random_sample       Randomly sample records from a file

Examples

Print the first 10 records of a Parquet file:

$ data-toolset head my_data.parquet -n 10
shape: (1, 7)
┌───────────┬─────┬──────────┬────────┬──────────────────────────┬────────────────────────────┬──────────────────┐
│ character ┆ age ┆ is_human ┆ height ┆ quote                    ┆ friends                    ┆ appearance       │
│ ---       ┆ --- ┆ ---      ┆ ---    ┆ ---                      ┆ ---                        ┆ ---              │
│ str       ┆ i64 ┆ bool     ┆ f64    ┆ str                      ┆ list[str]                  ┆ struct[2]        │
╞═══════════╪═════╪══════════╪════════╪══════════════════════════╪════════════════════════════╪══════════════════╡
│ Alice     ┆ 10  ┆ true     ┆ 150.5  ┆ Curiouser and curiouser! ┆ ["Rabbit", "Cheshire Cat"] ┆ {"blue","small"} │
└───────────┴─────┴──────────┴────────┴──────────────────────────┴────────────────────────────┴──────────────────┘

Query a Parquet file using a SQL-like expression:

$ data-toolset query my_data.parquet "SELECT * FROM 'my_data.parquet' WHERE height > 165"
shape: (2, 7)
┌─────────────────┬─────┬──────────┬────────┬───────────────────────┬────────────────────────────────────┬───────────────────┐
│ character       ┆ age ┆ is_human ┆ height ┆ quote                 ┆ friends                            ┆ appearance        │
│ ---             ┆ --- ┆ ---      ┆ ---    ┆ ---                   ┆ ---                                ┆ ---               │
│ str             ┆ i64 ┆ bool     ┆ f64    ┆ str                   ┆ list[str]                          ┆ struct[2]         │
╞═════════════════╪═════╪══════════╪════════╪═══════════════════════╪════════════════════════════════════╪═══════════════════╡
│ Mad Hatter      ┆ 35  ┆ true     ┆ 175.2  ┆ I'm late!             ┆ ["Alice"]                          ┆ {"green","tall"}  │
│ Queen of Hearts ┆ 50  ┆ false    ┆ 165.8  ┆ Off with their heads! ┆ ["White Rabbit", "King of Hearts"] ┆ {"red","average"} │
└─────────────────┴─────┴──────────┴────────┴───────────────────────┴────────────────────────────────────┴───────────────────┘

Merge multiple Avro files into one:

data-toolset merge file1.avro file2.avro file3.avro merged_file.avro

Convert Avro file into Parquet:

data-toolset to_parquet my_data.avro output.parquet

Convert Parquet file into JSON:

data-toolset to_json my_data.parquet output.json

Contributing

Contributions are welcome! If you have any suggestions, bug reports, or feature requests, please open an issue on GitHub.

TODO

optimizations [TBD]
benchmarking

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.7

Oct 25, 2023

0.1.6

Oct 25, 2023

0.1.5

Oct 18, 2023

0.1.4

Sep 21, 2023

0.1.2

Sep 15, 2023

0.1.1

Sep 14, 2023

0.1.0

Sep 14, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

data_toolset-0.1.7.tar.gz (11.8 kB view hashes)

Uploaded Oct 25, 2023 Source

Built Distribution

data_toolset-0.1.7-py3-none-any.whl (13.9 kB view hashes)

Uploaded Oct 25, 2023 Python 3

Hashes for data_toolset-0.1.7.tar.gz

Hashes for data_toolset-0.1.7.tar.gz
Algorithm	Hash digest
SHA256	`c4d6508c5ae85687612630f0a17f58dc84e8ab89091d8e344571a66ea6d80d57`
MD5	`ffe4fd05188b531039c4408a3ea91c50`
BLAKE2b-256	`b35123257607436ca8de19e1c23341cd7a970b75bc61b862ff5f03e0dc5697f2`

Hashes for data_toolset-0.1.7-py3-none-any.whl

Hashes for data_toolset-0.1.7-py3-none-any.whl
Algorithm	Hash digest
SHA256	`aa9ff68869323a6a258b12b717bef3229f414032026624ad97b90d25537f7f03`
MD5	`a7c1dcef040bba1e61828cd2e7dd8e3d`
BLAKE2b-256	`07ddc77831d0b11daaeb4e528ac8d993d7567a3a953a343a0fd8ade2092cff16`