Skip to main content

pent Extracts Numerical Text

Project description

Mini-language driven parser for structured numerical data

Current Development Version:

https://travis-ci.org/bskinn/pent.svg?branch=dev https://codecov.io/gh/bskinn/pent/branch/dev/graph/badge.svg

Most Recent Stable Release:

https://img.shields.io/pypi/v/pent.svg https://img.shields.io/pypi/pyversions/pent.svg

Info:

https://img.shields.io/readthedocs/pent/v0.2rc1.svg https://img.shields.io/github/license/mashape/apistatus.svg https://img.shields.io/badge/code%20style-black-000000.svg

Do you have structured numerical data stored as text?

Does the idea of writing regex to parse it fill you with loathing?

pent can help!

Say you have data in a text file that looks like this:

$vibrational_frequencies
18
    0        0.000000
    1        0.000000
    2        0.000000
    3        0.000000
    4        0.000000
    5        0.000000
    6      194.490162
    7      198.587114
    8      389.931897
    9      402.713910
   10      538.244274
   11      542.017838
   12      548.246738
   13      800.613516
   14     1203.096114
   15     1342.200360
   16     1349.543713
   17     1885.157022

What’s the most efficient way to get that list of floats extracted into a numpy array? There’s clearly structure here, but how to exploit it?

It would work to import the text into a spreadsheet, split columns appropriately, re-export just the one column to CSV, and import to Python from there, but that’s just exhausting drudgery if there are dozens of files involved.

Automating the parsing via a line-by-line string search would work fine (this is how cclib implements its data imports), but a new line-by-line method must be implemented any time one encounters a new kind of dataset, and any time the formatting of a given dataset changes between software versions.

It’s not too hard to write regex that will parse it, but because of the mechanics of regex group captures you have to write two patterns: one to capture the entire block, including the header (to ensure other, similarly-formatted data isn’t also captured); and then one to iterate line-by-line over just the data block to extract the individual values. And, of course, one has to actually write (and proofread, and maintain) the regex.

pent provides a better way.

The data above comes from this file, C2F4_01.hess. With pent, the data can be pulled into numpy in just a couple of lines, without writing any regex at all:

>>> with (pathlib.Path() / "pent" / "test" / "C2F4_01.hess").open() as f:
...     data = f.read()
>>> prs = pent.Parser(
...     head=("@.$vibrational_frequencies", "#.+i"),
...     body=("#.+i #!..f")
... )
>>> arr = np.array(prs.capture_body(data), dtype=float)
>>> print(arr)
[[[   0.      ]
  [   0.      ]
  [   0.      ]
  [   0.      ]
  [   0.      ]
  [   0.      ]
  [ 194.490162]
  [ 198.587114]
  [ 389.931897]
  [ 402.71391 ]
  [ 538.244274]
  [ 542.017838]
  [ 548.246738]
  [ 800.613516]
  [1203.096114]
  [1342.20036 ]
  [1349.543713]
  [1885.157022]]]

The result comes out as a length-one list of 2-D matrices, since the search pattern occurs only once in the data file. The single 2-D matrix is laid out as a column vector, because the data runs down the column in the file.

pent can handle larger, more deeply nested data as well. Take this 18x18 matrix within C2F4_01.hess, for example. Here, it’s necessary to pass a Parser as the body of another Parser:

>>> prs_hess = pent.Parser(
...     head=("@.$hessian", "#.+i"),
...     body=pent.Parser(
...         head="#++i",
...         body="#.+i #!+.f"
...     )
... )
>>> result = prs_hess.capture_body(data)
>>> arr = np.column_stack(np.array(_, dtype=float) for _ in result[0])
>>> print(arr[:3, :7])
[[ 0.468819 -0.006771  0.020586 -0.38269   0.017874 -0.05449  -0.044552]
 [-0.006719  0.022602 -0.016183  0.010997 -0.033397  0.014422 -0.01501 ]
 [ 0.020559 -0.016184  0.066859 -0.033601  0.014417 -0.072836  0.045825]]

The need for the for/in iteration expression, the [0] index into result, and the composition via np.column_stack arises due to the manner in which pent returns data from a nested match like this. See the documentation for more information.

The grammar of the pent mini-language is designed to be flexible enough that it should handle essentially all well-formed structured data, and even some data that’s not especially well formed. Some datasets will require post-processing of the data structures generated by pent before they can be pulled into numpy (see, e.g., this test, parsing this data block).


Alpha release(s) available on PyPI: pip install pent

Full documentation (pending) is hosted at Read The Docs.

Source on GitHub. Bug reports, feature requests, and Parser pattern composition help requests are welcomed at the Issues page there.

Copyright (c) Brian Skinn 2018

License: The MIT License. See LICENSE.txt for full license terms.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pent-0.2rc1.tar.gz (14.5 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page