Skip to main content

Probabilistic type inference

Project description

build-publish on release build on develop PyPI version Documentation status Downloads Binder

1 Introduction

This repository provides the source code of a Python package for ptype and its extension ptype-cat.

1.1 ptype

ptype is a probabilistic approach to type inference, which is the task of identifying the data type (e.g. Boolean, date, integer or string) of a given column of data.

Existing approaches often fail on type inference for messy datasets where data is missing or anomalous. With ptype, our goal is to develop a robust method that can deal with such data.

https://raw.githubusercontent.com/alan-turing-institute/ptype/release/notes/motivation.png

Normal, missing and anomalous values are denoted by green, yellow and red, respectively in the right hand figure.

ptype uses Probabilistic Finite-State Machines (PFSMs) to model known data types, missing and anomalous data. Given a column of data, we can infer a plausible column type, and also identify any values which (conditional on that type) are deemed missing or anomalous. In contrast to more familiar finite-state machines, such as regular expressions, that either accept or reject a given data value, PFSMs assign probabilities to different values. They therefore offer the advantage of generating weighted predictions when a column of messy data is consistent with more than one type assignment.

If you use this package, please cite the ptype paper, using the following BibTeX entry:

@article{ceritli2020ptype,
  title={ptype: probabilistic type inference},
  author={Ceritli, Taha and Williams, Christopher K I and Geddes, James},
  journal={Data Mining and Knowledge Discovery},
  year={2020},
  volume = {34},
  number = {3},
  pages={870–-904},
  doi = {10.1007/s10618-020-00680-1},
}

1.2 ptype-cat

A weakness of ptype is that it does not handle well type inference for categorical variables which are non-Boolean. For example, most existing methods including ptype treat the “Class Name” and “Rating” columns in the example below as string and integer types respectively, rather than categoricals. Therefore the user needs to manually convert their assigned types.

https://raw.githubusercontent.com/alan-turing-institute/ptype/release/notes/motivation-ptype-cat.png

The data on the left-hand side are sampled from a dataset about clothing.

To (semi-)automate this manual task, we introduce ptype-cat, which is an extension of ptype to enable detection of the general categorical type, including the non-Boolean categorical variables. ptype-cat combines the output of ptype with additional features such as the number of unique values in a column, and runs a Logistic Regression classifier to determine whether a column denotes a categorical variable or not when a column is labeled with the integer or string type by ptype.

Please see the ptype-cat paper for the details of ptype-cat, for which you can use the following BibTeX entry to cite:

@inproceedings{ptype-cat,
  title={ptype-cat: Inferring the Type and Values of Categorical Variables},
  author={Ceritli, Taha and Williams, Christopher K I},
  booktitle={21st ECML-PKDD Automating Data Science Workshop},
  year={2021},
}

2 Install requirements

You can simply install ptype from PyPI:

pip install ptype

3 Usage

See demo notebooks in notebooks folder. View them online via Binder.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

ptype-0.2.17.tar.gz (26.3 kB view hashes)

Uploaded Source

Built Distribution

ptype-0.2.17-py3-none-any.whl (26.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page