DataGristle 0.3
Tough and flexible connectors and tools for data analysis, transformation, validation and movement.
Datagristle is a toolbox of tough and flexible data connectors and analyzers. It's kind of an interactive mix between ETL and data analysis optimized for rapid analysis and manipulation of a wide variety of data.
It's neither an enterprise ETL tool, nor an enterprise analysis, reporting, or data mining tool. It's intended to be an easily-adopted tool for technical analysts that combines the most useful subset of data transformation and analysis capabilities necessary to do 80% of the work. Its open source python codebase allows it to be easily extended to with custom code to handle that always challenging last 20%.
Current Status: Strong support for easy analysis and simple transformations of csv files.
- Next Steps:
- README markdown
- attractive PDF output of gristle_determinator.py
- metadata database population
- Its objectives include:
- multi-platform (linux, mac os, windows)
- multi-language (primarily python)
- free - no cripple-licensing
- primary audience is programming data analysts - not non-technical analysts
- primary environment is command-line rather than windows, graphical desktop or eclipse
- extensible
- allow a bi-directional iteration between ETL & data analysis
- can quickly perform initial data analysis prior to longer-duration, deeper analysis with heavier-weight tools.
- DEPENDENCIES
- Python 2.6
- EXISTING UTILITIES:
- gristle_determinator.py
- Identifies file formats, generates metadata, prints file analysis report
- gristle_diff.py
- Shows differences between two files
- gristle_file_converter.py
- Converts a csv from one dialect to another. Can handle multi-character field delimiters as well as record delimiters.
- gristle_filter.py
- Applies simple filter logic to file.
- Very simplistic utility.
- gristle_freq.py
- Prints a frequency distribution of any column of an input file.
- gristle_graphviz_generator.py
- Generates a graphiz dot file based upon an input file and command-line preferences.
- gristle_scalar.py
- Performs scalar operations (min, max, avg, count unique, etc) on a file
- Very simplistic utility.
- gristle_slicer.py
- Used to extract a subset of columns and rows out of an input file.
- gristle_viewer.py
- Shows one record from a file at a time - formatted based on metadata.
- FUTURE UTILITIES:
- gristle_metadata.py
- Manages metadata - allows users to query, add, update, delete file, field, transformation, reporting descriptions.
- gristle_generator
- Generates test data based on gristle metadata
- gristle_validator
- Confirms validity of database and file structure and contents.
- gristle_file_joiner.py
- joins two files on their common keys and produces a new file
- gristle_grouper.py
- reads a file, aggregates on a given set of fields, produces a new file
- gristle_db_loader.py
- loads a file into a database
- gristle_db_extractor.py
- extracts data from a database into a file
- gristle_field_merge.py
- prints the matched values from multiple files side by side along with counts
- Author: Ken Farmer
- Maintainer: Ken Farmer
- Home Page: http://kenfar.github.com/DataGristle/
- Keywords: database data etl analysis
- License: BSD
- Categories
- Package Index Owner: kenfar
- DOAP record: DataGristle-0.3.xml
