No project description provided
Project description
data-toolset
data-toolset is designed to simplify your data processing tasks by providing a more user-friendly alternative to the traditional JAR utilities like avro-tools and parquet-tools. With this Python package, you can effortlessly handle various data file formats, including Avro and Parquet, using a simple and intuitive command-line interface.
Installation
Python 3.9 and 3.10 are supported and tested (to some extent).
pip install poetry
pip install git+https://github.com/luminousmen/data-toolset.git
Usage
$ data-toolset -h
usage: data-toolset [-h] {head,tail,meta,schema,stats,query,validate,merge,count} ...
positional arguments:
{head,tail,meta,schema,stats,query,validate,merge,count}
commands
head Print the first N records from a file
tail Print the last N records from a file
meta Print a file's metadata
schema Print the Avro schema for a file
stats Print statistics about a file
query Query a file
validate Validate a file
merge Merge multiple files into one
count Count the number of records in a file
optional arguments:
-h, --help show this help message and exit
Examples
Print the first 10 records of a Parquet file:
data-toolset head my_data.parquet -n 10
Query a Parquet file using a SQL-like expression:
data-toolset query my_data.parquet "SELECT * FROM 'my_data.parquet' WHERE age > 25"
Merge multiple Avro files into one:
data-toolset merge file1.avro file2.avro file3.avro merged_file.avro
Contributing
Contributions are welcome! If you have any suggestions, bug reports, or feature requests, please open an issue on GitHub.
TODO
- proper online documentation
- update README
- proper method docstrings
- add tests for validate and merge and count
- create an artifact on PyPi
- create random_sample function
- create schema_evolution function
- mature create_sample function
- to/from csv and json functionality
- optimizations TBD
- test coverage
- support 3.11+
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for data_toolset-0.1.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 88d4298acdfddbd244ea459513b3af74865448ee94ccb944f28e656791b5fde6 |
|
MD5 | 53516f6ce928252ef9a87fb206acfbc2 |
|
BLAKE2b-256 | 9af3f7b7df52538bd4b62018ab85b8f2af310e017742e151472f2b120caef3d2 |