ptype

Probabilistic type inference

These details have not been verified by PyPI

Project links

Homepage

License
- OSI Approved :: MIT License
Operating System
- OS Independent
Programming Language
- Python :: 3
Topic
- Scientific/Engineering
- Utilities

Project description

1 Introduction

This repository provides the source code of a Python package for ptype and its extension ptype-cat.

1.1 ptype

ptype is a probabilistic approach to type inference, which is the task of identifying the data type (e.g. Boolean, date, integer or string) of a given column of data.

Existing approaches often fail on type inference for messy datasets where data is missing or anomalous. With ptype, our goal is to develop a robust method that can deal with such data.

https://raw.githubusercontent.com/alan-turing-institute/ptype/release/notes/motivation.png — Normal, missing and anomalous values are denoted by green, yellow and red, respectively in the right hand figure.

ptype uses Probabilistic Finite-State Machines (PFSMs) to model known data types, missing and anomalous data. Given a column of data, we can infer a plausible column type, and also identify any values which (conditional on that type) are deemed missing or anomalous. In contrast to more familiar finite-state machines, such as regular expressions, that either accept or reject a given data value, PFSMs assign probabilities to different values. They therefore offer the advantage of generating weighted predictions when a column of messy data is consistent with more than one type assignment.

If you use this package, please cite the ptype paper, using the following BibTeX entry:

@article{ceritli2020ptype,
  title={ptype: probabilistic type inference},
  author={Ceritli, Taha and Williams, Christopher K I and Geddes, James},
  journal={Data Mining and Knowledge Discovery},
  year={2020},
  volume = {34},
  number = {3},
  pages={870–-904},
  doi = {10.1007/s10618-020-00680-1},
}

1.2 ptype-cat

A weakness of ptype is that it does not handle well type inference for categorical variables which are non-Boolean. For example, most existing methods including ptype treat the “Class Name” and “Rating” columns in the example below as string and integer types respectively, rather than categoricals. Therefore the user needs to manually convert their assigned types.

https://raw.githubusercontent.com/alan-turing-institute/ptype/release/notes/motivation-ptype-cat.png — The data on the left-hand side are sampled from a dataset about clothing.

To (semi-)automate this manual task, we introduce ptype-cat, which is an extension of ptype to enable detection of the general categorical type, including the non-Boolean categorical variables. ptype-cat combines the output of ptype with additional features such as the number of unique values in a column, and runs a Logistic Regression classifier to determine whether a column denotes a categorical variable or not when a column is labeled with the integer or string type by ptype.

Please see the ptype-cat paper for the details of ptype-cat, for which you can use the following BibTeX entry to cite:

@inproceedings{ptype-cat,
  title={ptype-cat: Inferring the Type and Values of Categorical Variables},
  author={Ceritli, Taha and Williams, Christopher K I},
  booktitle={21st ECML-PKDD Automating Data Science Workshop},
  year={2021},
}

2 Install requirements

You can simply install ptype from PyPI:

pip install ptype

3 Usage

See demo notebooks in notebooks folder. View them online via Binder.

Algorithm	Hash digest
SHA256	`d1adde98cc105025cc47f2806edfee0c6e3bdf7ab1e4f9420f7367c16b49d2f7`
MD5	`e4a6ee16beaf8fb56d2fc683a0ee150b`
BLAKE2b-256	`9409dda468dbae2d432b03d436e9c592c8c98c2004dc09e40a8ae404e41050b0`

Algorithm	Hash digest
SHA256	`5796ea97aaff25ab673216c3d27f82490a7a30fc6a917928ca83e0944076ef89`
MD5	`83d05b986932f7e49a8ca3ba083b472d`
BLAKE2b-256	`3d3de0a625aac2ef3d62693174d8e4310c71bd6c0307ded9222b71cf956dec62`

ptype 0.2.17

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

1 Introduction

1.1 ptype

1.2 ptype-cat

2 Install requirements

3 Usage

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes