Simple tokenizers: n-grams and chargrams splitting, white space splitting, or splitting using configurable REGEX expression. Based on Span and Token objects from the tokenspan package.
Project description
Tokenization for language processing
This package contains some generic configurable tools allowing to cut a string in sub-parts (cf. Wikipedia), called Token
, and to group them into sequences called Tokens
. A Token
is a sub-string from a parent string (say the initial complete text), with associated ranges of non-overlaping characters. The number of associated ranges is arbitrary. A Tokens
is a collection of Token
. These two classes allow to associate to any Token
a collection of attributes in a versatile way, and to pass these attributes from one object to the next one while cutting Token
into sub-parts (collected as Tokens
) and eventually re-merging them into larger Token
.
Token
and Tokens
classes allow basic tokenization of text, such as word splitting, n-gram splitting, char-gram splitting of arbitrary size. In addition, it allows to associate several non-overlapping sub-strings into a given Token
, and to associate arbitrary attributes to these parts. One can compare two different Token
objects in terms of their attributes and/or ranges. One can also apply basic mathematical operations and logic to them (+,-,*,/) corresponding to the union, difference, intersection and symmetric difference implemented by Python set ; here the sets are the ranges of position from the parent string.
Installation
- The documentation is available on https://nlp.frama.io/iamtokenizing/
- The PyPi package is available on https://pypi.org/project/iamtokenizing/
- The official repository is on https://framagit.org/nlp/iamtokenizing
From Python Package Index (PIP)
Simply run
pip install iamtokenizing
is sufficient.
From the repository
The official repository is on https://framagit.org/nlp/iamtokenizing
Once the repository has been downloaded (or cloned), one can install this package using pip
:
git clone https://framagit.org/nlp/iamtokenizing.git
cd iamtokenizing/
pip install .
Once installed, one can run some tests using
cd tests/
python3 -m unittest -v
(verbosity -v
is an option).
Basic examples
Basic examples can be found in the documentation.
Versions
- Versions before 0.4 only present the
Token
andTokens
classes. They have been splitted after in three classes, namedSpan
,Token
andTokens
. Importantly, the methodsToken.append
andToken.remove
no longer exist in the next version. They have been replaced byToken.append_range
,Token.append_ranges
,Token.remove_range
andToken.remove_ranges
. - Version 0.4 add the class
Span
toToken
andTokens
.Span
handles the sub-parts splitting of a given string, whereasToken
andTokens
now consumesSpan
objects and handle the attributes of theToken
. - From version 0.5, one has split the basic tools
Span
,Token
andTokens
from theiamtokenizing
package (see https://pypi.org/project/iamtokenizing/). Only the advanced tokenizer are now present in the packageiamtokenizing
, which depends on the packagetokenspan
. The objectsSpan
,Token
andTokens
can be called as before from the newly deployed packagetokenspan
, available on https://pypi.org/project/tokenspan/.
About us
Package developped for Natural Language Processing at IAM : Unité d'Informatique et d'Archivistique Médicale, Service d'Informatique Médicale, Pôle de Santé Publique, Centre Hospitalo-Universitaire (CHU) de Bordeaux, France.
You are kindly encouraged to signal any trouble, and to propose ameliorations and/or suggestions to the authors, via issue or merge requests.
Last version : June 03, 2021
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for iamtokenizing-0.5.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8b19678f8b7e401d1862f5784f6f4ae169812816a92f4081778ce912d5c79e82 |
|
MD5 | 99b1cf51328fa73a5c69a74e270fdc50 |
|
BLAKE2b-256 | e94f9db748982a7abbcb45d635f2766f4300a078bb42bd2704c59f09dee911fd |