eparse

Excel spreadsheet crawler and table parser for data extraction and querying

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Project description

eparse

Description

Excel spreadsheet crawler and table parser for data extraction and querying.

Features

Command-line interface
Recursive Excel file discovery
Tabular data extraction
SQLite database interface
CLI query tool
Summary data metrics

Installation

To install eparse, you can use pip and the latest version on PyPI:

$ pip install eparse

Or you can clone this repo and install from source:

$ git clone https://github.com/ChrisPappalardo/eparse.git
$ cd eparse
$ pip install .

Usage

eparse is intended to be used from the command-line. You can view supported commands and usage with --help as follows:

$ eparse --help
Usage: eparse [OPTIONS] COMMAND [ARGS]...

excel parser

Options:
  -i, --input TEXT   input dir(s) or file(s)
  -o, --output TEXT  output destination
  -d, --debug        use debug mode
  -l, --loose        find tables loosely
  -r, --recursive    find files recursively
  -t, --truncate     truncate dataframe output
  -v, --verbose      increase output verbosity
  --help             Show this message and exit.

Commands:
  parse  parse table(s) found in sheet for target(s)
  query  query eparse output
  scan   scan for excel files in target

Scan

To scan one or more directories for Excel files with descriptive information, you can use the scan command like so:

$ eparse -v -i <path_to_files> scan

Increase the verbosity with additional flags, such as -vvv, for more descriptive information about the file(s), including sheet names.

Parse

Excel files can be parsed as follows:

$ eparse -v -i <path_to_files> parse

This mode will list each table found in each Excel file to the command-line. This mode is useful for initial discovery for parseable data.

eparse uses a simple algorithm for identifying tables. Table “corners” are identified as cells that contain empty cells above and to the right (or sheet boundaries). A densely or sparsely populated 2x2 table must follow in order for data to be extracted in relation to that cell. eparse will automatically adjust for rowspan labels and empty table corners and the dense vs. sparse criterion can be controlled with the --loose flag.

eparse was written to accomodate various types of output formats and endpoints, including null:///, stdout:///, and sqlite3:///.

null

This mode is useful for validating files and generating descriptive info, and is the default. The command above with -v is an example of this mode, which lists out the tables found.

stdout

This mode is good for viewing data extracted from Excel files in the console. For example, you could view all tables found in Sheet1 with the following command:

$ eparse -i <path_to_files> -o stdout:/// parse -s "Sheet1"

eparse uses pandas to handle table data. You can view larger tables without truncation using the -t flag as follows:

$ eparse -t -i <path_to_files> -o stdout:/// parse -s "Sheet1"

Data in table format is useful for human viewing, but a serialized form is better for data interfacing. Serialize your output with the -z flag as follows:

$ eparse -t -i <path_to_files> -o stdout:/// parse -z

Each cell of extracted table data is serialized as follows:

row - 0-indexed table row number
column - 0-indexed table column number
value - the value of the cell as a str
type - the implied python type of the data found
c_header - the column header
r_header - the row header
excel_RC - the RC reference from the spreadsheet (e.g. B10)
sheet - the name of the sheet
f_name - the name of the file

sqlite3

eparse uses the peewee package for ORM and database integration. The interfaces module contains an ExcelParse model that provides data persistence and a common interface.

To create a SQLite3 database with your parsed Excel data, use the following command:

$ mkdir .files
$ eparse -i <path_to_files> -o sqlite3:/// parse

This command will automatically generate a unique database filename using the uuid python package in the .files/ sub-directory of the working directory. You may need to create this directory before running this command, as shown.

Query

Once you have stored parsed data, you can begin to query it using the peewee ORM. This can be done with the tool or directly with the database.

For example, query distinct column header names from a generated SQLite3 database as follows:

$ eparse -o stdout:/// query -i from_sqlite3 .files/<db_file> -m get_c_header
               c_header  Total Rows  Data Types  Distinct Values
  0             ABC-col         150           2               76
  1             DEF-col        3981           3               15
  2             GHI-col          20           1                2
  ..                ...         ...         ...              ...

This command will give descriptive information of each distinct c_header found, including total rows, unique data types, and distinct values.

You can also get raw un-truncated data as follows, which is the default behavior:

$ eparse -t -o stdout:/// query -i from_sqlite3 .files/<db_file>

Filtering data on content is easy. Use the --filter option as follows:

$ eparse -t -o stdout:/// query -i from_sqlite3 .files/<db_file> --filter f_name "somefile.xlsx"

The above command will filter all rows from an Excel file named somefile.xlsx. You can use any of the following django-style filters:

__eq equals X
__lt less than X
__lte less than or equal to X
__gt greater than X
__gte greater than or equal to X
__ne not equal to X
__in X is in
__is is X
__like like expression, such as %somestr%, case sensitive
__ilike like expression, such as %somestr%, case insensitive
__regexp regular expression matching such as ^.*?foo.*?$

Filters are applied to the ORM fields like so:

--filter row__gte 4 all extracted table rows >= 5
--filter f_name__ilike "%foo%" all data from filenames with foo
--filter value__ne 100 all data with values other than 100

Queried data can even be stored into a new database for creating curated data subsets, as follows:

$ eparse -t -o sqlite3:/// query -i from_sqlite3 .files/<db_file>

Since database files the tool generates when using sqlite3:/// are SQLite native, you can also use SQLite database client tools and execute raw SQL like so:

$ sudo apt-get install -y sqlite3-tools
$ sqlite3 .files/<db_file>
SQLite version 3.37.2 2022-01-06 13:25:41
Enter ".help" for usage hints.
sqlite> .schema
CREATE TABLE IF NOT EXISTS "excelparse" ("id" INTEGER NOT NULL PRIMARY KEY, "row" INTEGER NOT NULL, "column" INTEGER NOT NULL, "value" VARCHAR(255) NOT NULL, "type" VARCHAR(255) NOT NULL, "c_header" VARCHAR(255) NOT NULL, "r_header" VARCHAR(255) NOT NULL, "excel_RC" VARCHAR(255) NOT NULL, "name" VARCHAR(255) NOT NULL, "sheet" VARCHAR(255) NOT NULL, "f_name" VARCHAR(255) NOT NULL);
sqlite> .header on
sqlite> SELECT * FROM excelparse limit 1;
id|row|column|value|type|c_header|r_header|excel_RC|name|sheet|f_name
1|0|0|ABC|<class 'str'>|SomeCol|SomeRow|B2|MyTable|Sheet1|myfile.xlsm

Contributing

As an open-source project, contributions are always welcome. Please see Contributing for more information.

License

eparse is licensed under the MIT License. See the LICENSE file for more details.

Contact

Thanks for your support of eparse. Feel free to contact me at cpappala@gmail.com or connect with me on Github.

History

0.1.0 (2023-06-06)

First release on PyPI.

0.1.1 (2023-06-06)

Updated requirements
Updated README

0.1.2 (2023-06-07)

Updated README

0.2.0 (2023-06-12)

Added migrate command
Added 0.1.2 to 0.2.0 migration

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 2 - Pre-Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

0.7.3

Aug 23, 2023

0.7.2

Aug 22, 2023

0.7.1

Aug 20, 2023

0.7.0

Aug 18, 2023

0.6.3

Jul 30, 2023

0.6.2

Jul 2, 2023

0.6.1

Jun 22, 2023

0.6.0

Jun 15, 2023

0.5.0

Jun 14, 2023

0.4.1

Jun 14, 2023

0.3.0

Jun 14, 2023

This version

0.2.1

Jun 13, 2023

0.2.0

Jun 13, 2023

0.1.2

Jun 8, 2023

0.1.1

Jun 7, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

eparse-0.2.1.tar.gz (34.7 kB view hashes)

Uploaded Jun 13, 2023 Source

Built Distribution

eparse-0.2.1-py2.py3-none-any.whl (12.4 kB view hashes)

Uploaded Jun 13, 2023 Python 2 Python 3

Hashes for eparse-0.2.1.tar.gz

Hashes for eparse-0.2.1.tar.gz
Algorithm	Hash digest
SHA256	`f1382e2fd3482f09ee5e7b542b2321141422fd75d69eeb82a7973b1f5f8d3ac1`
MD5	`1ca1cbd2a77f43369c2ea7d850913a8a`
BLAKE2b-256	`5ab34a52321773fbe761669174d6773d5623b8116340e2a5371d57190e5dd90a`

Hashes for eparse-0.2.1-py2.py3-none-any.whl

Hashes for eparse-0.2.1-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`00652eaaa189a2269082358b1c3f288eba04c0cdd77a5438fcd1e3325f9137fa`
MD5	`750f52e2b57c21c4a18a88fe714d14d7`
BLAKE2b-256	`5462d64a93cd72d61ccce876c0a3001d392b805343680e36bd5e764a86b2dd8c`