FuzzTypes is a Pydantic extension for annotating autocorrecting fields
Project description
FuzzTypes
FuzzTypes is a set of "autocorrecting" annotation types that expands upon Pydantic's included data conversions. Designed for simplicity, it provides powerful normalization capabilities (e.g. named entity linking) to ensure structured data is composed of "smart things" not "dumb strings".
Basic Use Case
todo compare and contrast with default Pydantic data conversion
Structured Data Generation Use Case
Several libraries (e.g. Instructor, Outlines, Marvin) use Pydantic to define models for structured data generation using Large Language Models (LLMs) via function calling or a grammar/regex based sampling approach based on the JSON schema generated by Pydantic.
This approach allows for the enumeration of allowed values using
Python's Literal
, Enum
or JSON Schema's examples
field directly
in your Pydantic class declaration which is used by the LLM to
generate valid values. This approach works exceptionally well for
low-cardinality (not many unique allowed values) such as the world's
continents (7 in total).
This approach, however, doesn't scale well for high-cardinality (many unique allowed values) such as the number of known human genomic variants (~325M). Where exactly the cutoff is between "low" and "high" cardinality is an exercise left to the reader and their use case.
That's where FuzzTypes come in. The allowed values are managed by the FuzzTypes annotations and the values are resolved during the Pydantic validation process.
Base Types
type | description |
---|---|
Alias | Match by name or alias. |
Function | Match by calling a custom function. |
Fuzz | Match by name or alias via fuzzy string similarity using RapidFuzz. |
Hybrid | Match by name or alias via reciprocal rank fusion of semantic and fuzzy similarity. |
Name | Match by name only. |
Regex | Match by regular expression pattern using re standard library. |
Semantic | Match by name or alias via vector-based semantic similarity using PyNNDescent. |
Typeahead | Match by name or alias prefix via Trie lookups with fuzzy or semantic fallback. |
Usable Types
Type | Description |
---|---|
ASCII | Convert Unicode string to ASCII equivalent using anyascii. |
Airport | Represents airport names (e.g., O'Hare International Airport) for detailed aviation-related data. |
AirportCode | Manages airport codes (e.g., ORD) for quick and standardized airport identification. |
CleanURL | Normalized URL with trackers removed using url-normalize. |
Country | Represents country names, such as Germany or United States, for standardized country identification. |
CountryCode | Handles ISO country codes (e.g., DE, UK, US) for concise representation of countries. |
Currency | Handles currency codes (e.g., USD) for financial transactions and currency representation. |
Date | Convert date strings to Date object using DateParser. |
Regex for extracting a single valid email from a string. | |
Emoji | Matches emojis based on Unicode Consortium aliases. Utilizes the Emoji project for matching. |
Integer | Convert number or ordinal text to an int using NumberParser. |
Language | Manages full language names (e.g., English, German) for clear language specification. |
LanguageCode | Deals with ISO language codes (e.g., en, de) for brief language identification. |
Person | Parse human name into subfields (e.g. first, last, suffix) using python-nameparser. |
Quantity | Converts strings to Quantity objects, combining value and unit of measurement, via Pint. |
SSN | Regex for extracting a single social security number from a string. |
Time | Convert date time strings to DateTime object using DateParser. |
USState | Represents U.S. state names (e.g., Ohio) for detailed geographical categorization within the United States. |
USStateCode | Manages U.S. state codes (e.g., OH) for abbreviated state representation. |
Zipcode | Regex for extracting a 5 or 9 digit zipcode from a string. |
Common Arguments
argument | type | description |
---|---|---|
case_sensitive | bool | If False, matches regardless of case. If True, matches only if case is exact. Default False. |
examples | list | Example values used in schema generation. |
notfound_mode | Literal | raise: Raises an error if key not found. none: Returns None if key not found. allow: Returns key if not found. |
tiebreaker_mode | Literal | raise: Raises error if tied (value, priority). lesser: Returns lower value answer. greater: Returns greater value answer. |
validator_mode | str | before: Resolves value before validation. Currently the only tested option. |
Lazy Dependencies
FuzzTypes leverages several powerful libraries to extend its functionality.
These dependencies are not installed by default with FuzzTypes to keep the installation lightweight. Instead, they are optional and can be installed as needed depending on which types you use.
Below is a list of these dependencies, including their licenses and what specific Types require them.
Type | Dependency | License | Usage |
---|---|---|---|
ASCII | anyascii | ISC | An alternative to unidecode for Unicode to ASCII conversion, offering extensive character mapping. |
ASCII | unidecode | GPL | Converts Unicode strings to their ASCII equivalents, providing broad character support with minimal size. |
Date | dateparser | BSD-3 | Parses date strings in almost any string formats to Date objects, supporting multiple locales. |
Emoji | emoji | BSD | Matches emojis based on Unicode Consortium aliases, enhancing text processing with emoji support. |
Fuzz | rapidfuzz | MIT | Performs fuzzy string matching to find close matches to names or aliases with high performance. |
Integer | number-parser | BSD-3 | Converts number or ordinal text to integers, handling both written and numerical forms. |
Person | nameparser | LGPL | Parses human names into subfields (e.g., first, last, suffix), aiding in structured name handling. |
Semantic | pynndescent | MIT | Fast Approximate Nearest Neighbors library for retrieving similar text. |
Semantic | sentence-transformers | MIT | Default embedding library for encoding text into dense vector embeddings. |
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for fuzztypes-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 927150b871d9199f7428bafd05971285ea9c0d1a0cc0f1a721c1dbd7e0e27307 |
|
MD5 | 1802a86158f2b9dda0bc2bb939e3a35a |
|
BLAKE2b-256 | 2b6f96f856de6bd278b490046f2e1b16c08d1093d94b76cace3b276dc5d3cd79 |