skip to navigation
skip to content

Not Logged In

datagristle 0.53

A toolbox and library of ETL & data analysis tools

Latest Version: 0.56

Datagristle is a toolbox of tough and flexible data connectors and analyzers.
It's kind of an interactive mix between ETL and data analysis optimized for
rapid analysis and manipulation of a wide variety of data.

It's neither an enterprise ETL tool, nor an enterprise analysis, reporting,
or data mining tool.  It's intended to be an easily-adopted tool for technical
analysts that combines the most useful subset of data transformation and
analysis capabilities necessary to do 80% of the work.  Its open source python
codebase allows it to be easily extended to with custom code to handle that
always challenging last 20%.

Current Status:  Strong support for easy analysis, simple transformations of
csv files, and ability to create data dictionaries.

More info is on the DataGristle wiki here:
   https://github.com/kenfar/DataGristle/wiki


#Next Steps:

   * attractive PDF output of gristle_determinator.py
   * metadata database population

#Its objectives include:

   * multi-platform (unix, linux, mac os, windows with effort)
   * multi-language (primarily python)
   * free - no cripple-licensing
   * primary audience is programming data analysts - not non-technical analysts
   * primary environment is command-line rather than windows, graphical desktop
     or eclipse
   * extensible
   * allow a bi-directional iteration between ETL & data analysis
   * can quickly perform initial data analysis prior to longer-duration, deeper
     analysis with heavier-weight tools.


#Installation

   * Using [pip](http://www.pip-installer.org/en/latest/) (preferred) or [easyinstall](http://peak.telecommunity.com/DevCenter/EasyInstall):

       ~~~
       $ pip install datagristle
       $ easy_install datagristle
       ~~~

   * Or install manually from [pypi](https://pypi.python.org/pypi/datagristle):

       ~~~
       $ mkdir ~\Downloads
       $ wget https://pypi.python.org/packages/source/d/datagristle/datagristle-0.51.tar.gz
       $ tar -xvf easy_install datagristle
       $ cd ~\Downloads\datagristle-*
       $ python setup.py install
       ~~~


#Dependencies

   * Python 2.6 or Python 2.7


#Mature Utilities Provided in This Release:

   * gristle_slicer
       - Used to extract a subset of columns and rows out of an input file.
   * gristle_freaker
       - Produces a frequency distribution of multiple columns from input file.
   * gristle_viewer
       - Shows one record from a file at a time - formatted based on metadata.
   * gristle_determinator
       - Identifies file formats, generates metadata, prints file analysis report
       - This is the most mature - and also used by the other utilities so that
         you generally do not need to enter file structure info.

#gristle_slicer
    Extracts subsets of input files based on user-specified columns and rows.
    The input csv file can be piped into the program through stdin or identified
    via a command line option.  The output will default to stdout, or redirected
    to a filename via a command line option.

    The columns and rows are specified using python list slicing syntax -
    so individual columns or rows can be listed as can ranges.   Inclusion
    or exclusion logic can be used - and even combined.

    Examples:
       $ gristle_slicer sample.csv
                    Prints all rows and columns
       $ gristle_slicer sample.csv -c":5, 10:15" -C 13
                    Prints columns 0-4 and 10,11,12,14 for all records
       $ gristle_slicer sample.csv -C:-1
                    Prints all columns except for the last for all records
       $ gristle_slicer sample.csv -c:5 -r-100
                    Prints columns 0-4 for the last 100 records
       $ gristle_slicer sample.csv -c:5 -r-100 -d'|' --quoting=quote_all
                    Prints columns 0-4 for the last 100 records, csv
                    dialect info (delimiter, quoting) provided manually)
       $ cat sample.csv | gristle_slicer -c:5 -r-100 -d'|' --quoting=quote_all
                    Prints columns 0-4 for the last 100 records, csv
                    dialect info (delimiter, quoting) provided manually)


#gristle_freaker
    Creates a frequency distribution of values from columns of the input file
    and prints it out in columns - the first being the unique key and the last
    being the count of occurances.


    Examples:
       $ gristle_freaker sample.csv -d '|'  -c 0
                    Creates two columns from the input - the first with
                    unique keys from column 0, the second with a count of
                    how many times each exists.
       $ gristle_freaker sample.csv -d '|'  -c 0 --sortcol 1 --sortorder forward --writelimit 25
                    In addition to what was described in the first example,
                    this example adds sorting of the output by count ascending
                    and just prints the first 25 entries.
       $ gristle_freaker sample.csv -d '|'  -c 0 --sampling_rate 3 --sampling_method interval
                    In addition to what was described in the first example,
                    this example adds a sampling in which it only references
                    every third record.
       $ gristle_freaker sample.csv -d '|'  -c 0,1
                    Creates three columns from the input - the first two
                    with unique key combinations from columns 0 & 1, the
                    third with the number of times each combination exists.
       $ gristle_freaker sample.csv -d '|'  -c -1
                    Creates two columns from the input - the first with unique
                    keys from the last column of the file (negative numbers
                    wrap), then a second with the number of times each exists.
       $ gristle_freaker sample.csv -d '|'  --columntype all
                    Creates two columns from the input - all columns combined
                    into a key, then a second with the number of times each
                    combination exists.
       $ gristle_freaker sample.csv -d '|'  --columntype each
                    Unlike the other examples, this one performs a separate
                    analysis for every single column of the file.  Each analysis
                    produces three columns from the input - the first is a
                    column number, second is a unique value from the column,
                    and the third is the number of times that value appeared.
                    This output is repeated for each column.


#gristle_viewer
    Displays a single record of a file, one field per line, with field names
    displayed as labels to the left of the field values.  Also allows simple
    navigation between records.

    Examples:
       $ gristle_viewer sample.csv -r 3
                    Presents the third record in the file with one field per line
                    and field names from the header record as labels in the left
                    column.
       $ gristle_viewer sample.csv -r 3  -d '|' -q quote_none
                    In addition to what was described in the first example this
                    adds explicit csv dialect overrides.

#gristle_determinator
    Analyzes the structures and contents of csv files in the end producing a
    report of its findings.  It is intended to speed analysis of csv files by
    automating the most common and frequently-performed analysis tasks.  It's
    useful in both understanding the format and data and quickly spotting issues.

    Examples:
       $ gristle_determinator japan_station_radiation.csv
                    This command will analyze a file with radiation measurements
                    from various Japanese radiation stations.

        File Structure:
        format type:       csv
        field cnt:         4
        record cnt:        100
        has header:        True
        delimiter:
        csv quoting:       False
        skipinitialspace:  False
        quoting:           QUOTE_NONE
        doublequote:       False
        quotechar:         "
        lineterminator:    '\n'
        escapechar:        None

        Field Analysis Progress:
        Analyzing field: 0
        Analyzing field: 1
        Analyzing field: 2
        Analyzing field: 3

        Fields Analysis Results:

            ------------------------------------------------------
            Name:             station_id
            Field Number:     0
            Wrong Field Cnt:  0
            Type:             timestamp
            Min:              1010000001
            Max:              1140000006
            Unique Values:    99
            Known Values:     99
            Top Values not shown - all values are unique

            ------------------------------------------------------
            Name:             datetime_utc
            Field Number:     1
            Wrong Field Cnt:  0
            Type:             timestamp
            Min:              2011-02-28 15:00:00
            Max:              2011-02-28 15:00:00
            Unique Values:    1
            Known Values:     1
            Top Values:
                2011-02-28 15:00:00                      x 99 occurrences

            ------------------------------------------------------
            Name:             sa
            Field Number:     2
            Wrong Field Cnt:  0
            Type:             integer
            Min:              -999
            Max:              52
            Unique Values:    35
            Known Values:     35
            Mean:             2.45454545455
            Median:           38.0
            Variance:         31470.2681359
            Std Dev:          177.398613681
            Top Values:
                41                                       x 7 occurrences
                42                                       x 7 occurrences
                39                                       x 6 occurrences
                37                                       x 5 occurrences
                46                                       x 5 occurrences
                17                                       x 4 occurrences
                38                                       x 4 occurrences
                40                                       x 4 occurrences
                45                                       x 4 occurrences
                44                                       x 4 occurrences

            ------------------------------------------------------
            Name:             ra
            Field Number:     3
            Wrong Field Cnt:  0
            Type:             integer
            Min:              -888
            Max:              0
            Unique Values:    2
            Known Values:     2
            Mean:             -556.121212121
            Median:           -888.0
            Variance:         184564.833792
            Std Dev:          429.610095077
            Top Values:
                -888                                     x 62 occurrences
                0                                        x 37 occurrences

#gristle_metadata
    Gristle_metadata provides a command-line interface to the metadata database.
    It's mostly useful for scripts, but also useful for occasional direct
    command-line access to the metadata.

    Examples:
       $ gristle_metadata --table schema --action list
                    Prints a list of all rows for the schema table.
       $ gristle_metadata --table element --action put --prompt
                    Allows the user to input a row into the element table and
                    prompts the user for all fields necessary.

#gristle_metadata and gristle_md_reporter
    Gristle_md_reporter allows the user to create data dictionary reports that
    combine information about the collection and fields along with field value
    descriptions and frequencies.

    Examples:
       $ gristle_md_reporter --report datadictionary --collection_id 2
                    Prints a data dictionary report of collection_id 2.
       $ gristle_md_reporter --report datadictionary --collection_name presidents
                    Prints a data dictionary report of the president collection.
       $ gristle_md_reporter --report datadictionary --collection_id 2 --field_id 3
                    Prints a data dictionary report of the president collection,
                    only shows field-level information for field_id 3.



#Licensing

   * Gristle uses the BSD license - see the separate LICENSE file for further
     information


#Copyright

   * Copyright 2011,2012,2013 Ken Farmer
 
File Type Py Version Uploaded on Size
datagristle-0.53.tar.gz (md5) Source 2014-01-06 389KB
  • Downloads (All Versions):
  • 56 downloads in the last day
  • 344 downloads in the last week
  • 1339 downloads in the last month