tableschema

A utility library for working with Table Schema in Python

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

A library for working with Table Schema in Python.

Features

Table to work with data tables described by Table Schema
Schema representing Table Schema
Field representing Table Schema field
validate to validate Table Schema
infer to infer Table Schema from data
built-in command-line interface to validate and infer schemas
storage/plugins system to connect tables to different storage backends like SQL Database

Important Notes

There are BREAKING changes in v1 (pre-release):
package on PyPi has been renamed to tableschema
following deprecated API has been removed the package:
- tableschema.push/pull_resource (use tableschema.Table)
- tableschema.Validator (use tableschema.validate)
- tableschema.storage (use tableschema.Storage)
- tableschema.model (use tableschema.Schema)
- tableschema.types (use tableschema.Field)
rebased on Table Schema v1 null/types/constraints symantics
Field.cast/test_value now accepts constraints=bool/list argument instead of skip_constraints=bool and constraint=str
other changes could be introduced before final release
documentation for previous release (v0.10) could be found here
There are deprecating changes in v0.7:
renewed API has been introduced in non breaking manner
documentation for deprecated API could be found here

Gettings Started

Installation

$ pip install jsontableschema # v0.10
$ pip install tableschema --pre # v1.0-alpha

Example

from tableschema import Table

# Create table
table = Table('path.csv', schema='schema.json')

# Print schema descriptor
print(table.schema.descriptor)

# Print cast rows in a dict form
for keyed_row in table.iter(keyed=True):
    print(keyed_row)

Table

Table represents data described by Table Schema:

# pip install sqlalchemy tableschema-sql
import sqlalchemy as sa
from pprint import pprint
from tableschema import Table

# Data source
SOURCE = 'https://raw.githubusercontent.com/frictionlessdata/tableschema-py/master/data/data_infer.csv'

# Create SQL database
db = sa.create_engine('sqlite://')

# Data processor
def skip_under_30(erows):
    for number, headers, row in erows:
        krow = dict(zip(headers, row))
        if krow['age'] >= 30:
            yield (number, headers, row)

# Work with table
table = Table(SOURCE, post_cast=[skip_under_30])
table.schema.save('tmp/persons.json') # Save INFERRED schema
table.save('persons', backend='sql', engine=db) # Save data to SQL
table.save('tmp/persons.csv')  # Save data to DRIVE

# Check the result
pprint(Table('persons', backend='sql', engine=db).read(keyed=True))
pprint(Table('tmp/persons.csv').read(keyed=True))
# Will print (twice)
# [{'age': 39, 'id': 1, 'name': 'Paul'},
#  {'age': 36, 'id': 3, 'name': 'Jane'}]

Schema

A model of a schema with helpful methods for working with the schema and supported data. Schema instances can be initialized with a schema source as a filepath or url to a JSON file, or a Python dict. The schema is initially validated (see validate below), and will raise an exception if not a valid Table Schema.

from tableschema import Schema

# Init schema
schema = Schema('path.json')

# Cast a row
schema.cast_row(['12345', 'a string', 'another field'])

Methods available to Schema instances:

descriptor - return schema descriptor
fields - an array of the schema’s Field instances
headers - an array of the schema headers
primary_key - the primary key field for the schema as an array
foreignKey - the foreign key property for the schema as an array
get_field(name) - return the field object for given name
has_field(name) - return a bool if the field exists in the schema
cast_row(row, no_fail_fast=False) - return row cast against schema
save(target) - save schema to filesystem

Where the option no_fail_fast is given, it will collect all errors it encouters and an exceptions.MultipleInvalid will be raised (if there are errors).

Field

from tableschema import Field

# Init field
field = Field({'name': 'name', type': 'number'})

# Cast a value
field.cast_value('12345') # -> 12345

Data values can be cast to native Python objects with a Field instance. Type instances can be initialized with field descriptors. This allows formats and constraints to be defined.

Casting a value will check the value is of the expected type, is in the correct format, and complies with any constraints imposed by a schema. E.g. a date value (in ISO 8601 format) can be cast with a DateType instance. Values that can’t be cast will raise an InvalidCastError exception.

Casting a value that doesn’t meet the constraints will raise a ConstraintError exception.

validate

Given a schema as JSON file, url to JSON file, or a Python dict, validate returns True for a valid Table Schema, or raises an exception, SchemaValidationError. It validates only schema, not data against schema!

import io
import json

from tableschema import validate

with io.open('schema_to_validate.json') as stream:
    descriptor = json.load(stream)

try:
    tableschema.validate(descriptor)
except tableschema.exceptions.SchemaValidationError as exception:
   # handle error

It may be useful to report multiple errors when validating a schema. This can be done with no_fail_fast flag set to True.

try:
    tableschema.validate(descriptor, no_fail_fast=True)
except tableschema.exceptions.MultipleInvalid as exception:
    for error in exception.errors:
        # handle error

infer

Given headers and data, infer will return a Table Schema as a Python dict based on the data values. Given the data file, data_to_infer.csv:

id,age,name
1,39,Paul
2,23,Jimmy
3,36,Jane
4,28,Judy

Call infer with headers and values from the datafile:

import io
import csv

from tableschema import infer

filepath = 'data_to_infer.csv'
with io.open(filepath) as stream:
    headers = stream.readline().rstrip('\n').split(',')
    values = csv.reader(stream)

schema = infer(headers, values)

schema is now a schema dict:

{u'fields': [
    {
        u'description': u'',
        u'format': u'default',
        u'name': u'id',
        u'title': u'',
        u'type': u'integer'
    },
    {
        u'description': u'',
        u'format': u'default',
        u'name': u'age',
        u'title': u'',
        u'type': u'integer'
    },
    {
        u'description': u'',
        u'format': u'default',
        u'name': u'name',
        u'title': u'',
        u'type': u'string'
    }]
}

The number of rows used by infer can be limited with the row_limit argument.

CLI

It’s a provisional API excluded from SemVer. If you use it as a part of other program please pin concrete goodtables version to your requirements file.

Table Schema features a CLI called tableschema. This CLI exposes the infer and validate functions for command line use.

Example of validate usage:

$ tableschema validate path/to-schema.json

Example of infer usage:

$ tableschema infer path/to/data.csv

The response is a schema as JSON. The optional argument --encoding allows a character encoding to be specified for the data file. The default is utf-8.

Storage

The library includes interface declaration to implement tabular Storage:

Storage

An implementor should follow tableschema.Storage interface to write his own storage backend. This backend could be used with Table class. See plugins system below to know how to integrate custom storage plugin.

plugins

Table Schema has a plugin system. Any package with the name like tableschema_<name> could be imported as:

from tableschema.plugins import <name>

If a plugin is not installed ImportError will be raised with a message describing how to install the plugin.

A list of officially supported plugins:

BigQuery Storage - https://github.com/frictionlessdata/tableschema-bigquery-py
Pandas Storage - https://github.com/frictionlessdata/tableschema-pandas-py
SQL Storage - https://github.com/frictionlessdata/tableschema-sql-py

API Reference

Snapshot

Table(source, schema=None, post_cast=None, backend=None, **options)
    stream -> tabulator.Stream
    schema -> Schema
    name -> str
    iter(keyed/extended=False) -> (generator) (keyed/extended)row[]
    read(keyed/extended=False, limit=None) -> (keyed/extended)row[]
    save(target, backend=None, **options)
Schema(descriptor)
    descriptor -> dict
    fields -> Field[]
    headers -> str[]
    primary_key -> str[]
    foreign_keys -> str[]
    get_field(name) -> Field
    has_field(name) -> bool
    cast_row(row, no_fail_fast=False) -> row
    save(target)
Field(descriptor)
    descriptor -> dict
    name -> str
    type -> str
    format -> str
    constraints -> dict
    cast_value(value, constraints=True) -> value
    test_value(value, constraints=True) -> bool
validate(descriptor, no_fail_fast=False) -> bool
infer(headers, values) -> descriptor
exceptions
~cli
---
Storage(**options)
    buckets -> str[]
    create(bucket, descriptor, force=False)
    delete(bucket=None, ignore=False)
    describe(bucket, descriptor=None) -> descriptor
    iter(bucket) -> (generator) row[]
    read(bucket) -> row[]
    write(bucket, rows)
plugins

Detailed

Contributing

Please read the contribution guideline:

How to Contribute

Thanks!

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.20.11

Apr 1, 2024

1.20.10

Mar 22, 2024

1.20.9

Mar 13, 2024

1.20.7

Mar 13, 2024

1.20.6

Mar 12, 2024

1.20.5

Mar 12, 2024

1.20.4

Mar 12, 2024

1.20.3

Mar 12, 2024

1.20.2

Feb 24, 2021

1.20.1

Feb 24, 2021

1.20.0

Oct 6, 2020

1.19.5

Sep 26, 2020

1.19.4

Sep 12, 2020

1.19.3

Aug 15, 2020

1.19.2

Jun 3, 2020

1.18.0

May 20, 2020

1.17.2

May 18, 2020

1.17.0

Apr 29, 2020

1.16.4

Apr 27, 2020

1.16.2

Apr 24, 2020

1.16.1

Apr 24, 2020

1.16.0

Apr 23, 2020

1.15.3

Mar 26, 2020

1.15.2

Mar 26, 2020

1.15.0

Mar 3, 2020

1.14.0

Mar 2, 2020

1.13.1

Feb 19, 2020

1.13.0

Feb 18, 2020

1.12.5

Feb 10, 2020

1.12.4

Feb 5, 2020

1.12.3

Jan 13, 2020

1.12.2

Dec 17, 2019

1.12.1

Dec 15, 2019

1.12.0

Dec 10, 2019

1.11.0

Nov 25, 2019

1.10.0

Oct 31, 2019

1.9.0

Oct 31, 2019

1.8.0

Oct 9, 2019

1.7.4

Sep 27, 2019

1.7.3

Sep 26, 2019

1.7.2

Sep 18, 2019

1.7.1

Sep 18, 2019

1.7.0

Sep 3, 2019

1.6.0

Jul 8, 2019

1.5.4

Jun 28, 2019

1.5.3

Jun 23, 2019

1.5.2

Jun 10, 2019

1.5.1

Jun 6, 2019

1.5.0

May 23, 2019

1.4.1

Apr 17, 2019

1.4.0

Apr 11, 2019

1.3.3

Mar 25, 2019

1.3.2

Mar 25, 2019

1.3.1

Mar 25, 2019

1.3.0

Nov 26, 2018

1.2.5

Oct 18, 2018

1.2.4

Oct 9, 2018

1.2.3

Oct 8, 2018

1.2.2

Sep 19, 2018

1.2.1

Sep 12, 2018

1.2.0

Aug 16, 2018

1.1.0

May 29, 2018

1.0.13

Apr 12, 2018

1.0.12

Feb 20, 2018

1.0.11

Dec 20, 2017

1.0.10

Nov 20, 2017

1.0.9

Nov 20, 2017

1.0.8

Oct 1, 2017

1.0.7

Sep 30, 2017

1.0.6

Sep 30, 2017

1.0.5

Sep 30, 2017

1.0.4

Sep 27, 2017

1.0.3

Sep 20, 2017

1.0.2

Sep 18, 2017

1.0.1

Sep 7, 2017

1.0.0

Sep 4, 2017

1.0.0a14 pre-release

Aug 31, 2017

1.0.0a13 pre-release

Aug 29, 2017

1.0.0a12 pre-release

Aug 22, 2017

1.0.0a11 pre-release

Aug 22, 2017

1.0.0a10 pre-release

Aug 22, 2017

1.0.0a9 pre-release

Aug 19, 2017

1.0.0a8 pre-release

Jul 27, 2017

This version

1.0.0a7 pre-release

Jun 9, 2017

1.0.0a5 pre-release

May 25, 2017

1.0.0a4 pre-release

Apr 11, 2017

1.0.0a3 pre-release

Apr 5, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

tableschema-1.0.0a7.tar.gz (55.8 kB view hashes)

Uploaded Jun 9, 2017 Source

Built Distribution

tableschema-1.0.0a7-py2.py3-none-any.whl (55.5 kB view hashes)

Uploaded Jun 9, 2017 Python 2 Python 3

Hashes for tableschema-1.0.0a7.tar.gz

Hashes for tableschema-1.0.0a7.tar.gz
Algorithm	Hash digest
SHA256	`4feebf7e34a14531d4e3fbbc2dc82606b929a3efa17c695c5cbfc00a068c443a`
MD5	`2107083501d81c20fbe313f68c3bc918`
BLAKE2b-256	`63702f9fc8691a1cafc2189b070c0b0823deaba998aa65be68670f7b061e2926`

Hashes for tableschema-1.0.0a7-py2.py3-none-any.whl

Hashes for tableschema-1.0.0a7-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`cd014bf96be51a12eb0128612ed1516cca2d4a4ba0e4ec1bce3cd86766c166f0`
MD5	`1cb5f99457a8347b2055257f1ea6a88f`
BLAKE2b-256	`3ad9d77c2fed04f73b433e7426f68de23c8b1ab31bf70413a3ae42a800ae8087`