Simple tokenizers: n-grams and chargrams splitting, white space splitting, or splitting using configurable REGEX expression, or detection into context tokenization. Based on Span and Token objects from the tokenspan package.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Tokenization for language processing

This package contains some basic tools allowing to cut a string in sub-parts (cf. Wikipedia), called Token.

iamtokenizing classes allow basic tokenization of text, such as

word splitting, n-gram splitting, (using NGrams class)
char-gram splitting of arbitrary size (using CharGrams class).

NGrams also accepts any REGular EXpression (REGEX) to match pattern that will serve as splitting string. The class RegexDetector also allows to extract the REGEX pattern as token. In addition, ContextDetector allow to split text on some REGEX, and to detect inside these splits an other REGEX, keeping some organisation (called context) of the text between the two detection and splitting scales.

Installation

The documentation is available on https://nlp.frama.io/iamtokenizing/
The PyPi package is available on https://pypi.org/project/iamtokenizing/
The official repository is on https://framagit.org/nlp/iamtokenizing

From Python Package Index (PIP)

Simply run

pip install iamtokenizing

is sufficient.

From the repository

The official repository is on https://framagit.org/nlp/iamtokenizing

Once the repository has been downloaded (or cloned), one can install this package using pip :

git clone https://framagit.org/nlp/iamtokenizing.git
cd iamtokenizing/
pip install .

Once installed, one can run some tests using

cd tests/
python3 -m unittest -v

(verbosity -v is an option).

Basic examples

Basic examples can be found in the documentation.

Versions

Versions before 0.4 only present the Token and Tokens classes. They have been splitted after in three classes, named Span, Token and Tokens. Importantly, the methods Token.append and Token.remove no longer exist in the next version. They have been replaced by Token.append_range, Token.append_ranges, Token.remove_range and Token.remove_ranges.
Version 0.4 add the class Span to Token and Tokens. Span handles the sub-parts splitting of a given string, whereas Token and Tokens now consumes Span objects and handle the attributes of the Token.
From version 0.5, one has split the basic tools Span, Token and Tokens from the iamtokenizing package (see https://pypi.org/project/iamtokenizing/). Only the advanced tokenizer are now present in the package iamtokenizing, which depends on the package tokenspan. The objects Span, Token and Tokens can be called as before from the newly deployed package tokenspan, available on https://pypi.org/project/tokenspan/.

About us

Package developped for Natural Language Processing at IAM : Unité d'Informatique et d'Archivistique Médicale, Service d'Informatique Médicale, Pôle de Santé Publique, Centre Hospitalo-Universitaire (CHU) de Bordeaux, France.

You are kindly encouraged to flag any trouble, and to propose ameliorations and/or suggestions to the authors, via issue or merge requests.

Last version : August 6, 2021

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.7.0

Jan 4, 2023

0.6.2

Feb 17, 2022

0.6.1

Feb 17, 2022

This version

0.5.6

Feb 17, 2022

0.5.5

Jan 27, 2022

0.5.1

Aug 6, 2021

0.5.0

Aug 6, 2021

0.4.2

Aug 6, 2021

0.3.1

May 28, 2021

0.3.0

May 28, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iamtokenizing-0.5.6.tar.gz (21.6 kB view hashes)

Uploaded Feb 17, 2022 Source

Built Distribution

iamtokenizing-0.5.6-py3-none-any.whl (23.9 kB view hashes)

Uploaded Feb 17, 2022 Python 3

Hashes for iamtokenizing-0.5.6.tar.gz

Hashes for iamtokenizing-0.5.6.tar.gz
Algorithm	Hash digest
SHA256	`ab2e81e18d3d51219a177349591a0f296e4eedf31b3ba4a3946315a3b0bd71d7`
MD5	`05c8d2bfe5b4b6f29065d169ecdbf61c`
BLAKE2b-256	`3a6403065f157010d60f1e1a05a6633af383d0a1dbcd7edb2ad5646e06705402`

Hashes for iamtokenizing-0.5.6-py3-none-any.whl

Hashes for iamtokenizing-0.5.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`26e6b6051f59296f585cec518d4a71d3ef94716512912e344a55befb790e7b10`
MD5	`f9d61321c88d9af7921f9131883b659f`
BLAKE2b-256	`8ad120c0566473f1238fba8d1551ca0d24c7d8dc7181864027bdb53d2e59dbc1`