Tools to tokenize a string

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Tokenization for language processing

This package contains some generic configurable tools allowing to cut a string in sub-parts (cf. Wikipedia), called Token, and to group them into sequences called Tokens. A Token is a sub-string from a parent string (say the initial complete text), with associated ranges of non-overlaping characters. The number of associated ranges is arbitrary. A Tokens is a collection of Token. These two classes allow to associate to any Token a collection of attributes in a versatile way, and to pass these attributes from one object to the next one while cutting Token into sub-parts (collected as Tokens) and eventually re-merging them into larger Token.

Token and Tokens classes allow basic tokenization of text, such as word splitting, n-gram splitting, char-gram splitting of arbitrary size. In addition, it allows to associate several non-overlapping sub-strings into a given Token, and to associate arbitrary attributes to these parts. One can compare two different Token objects in terms of their attributes and/or ranges. One can also apply basic mathematical operations and logic to them (+,-,*,/) corresponding to the union, difference, intersection and symmetric difference implemented by Python set ; here the sets are the ranges of position from the parent string.

Installation

Once the repository has been downloaded (or cloned), one can install this package using pip :

pip install .

from the package main folder.

Once installed

Basic example

Below we give a simple example of usage of the Token and Tokens classes.

import re
from iamtokenizing import Token

string = 'Simple string for demonstration and for illustration.'
initial_token = Token(string)

# char-gram generation
chargrams = initial_token.slice(0,len(initial_token),3)
str(chargrams[2])
# return 'mpl'

# each char-gram conserves a memory of the initial string
chargrams[2].string
# return 'Simple string for demonstration and for illustration.'

cuts = [(r.start(),r.end()) for r in re.finditer(r'\w+',string)]
tokens = initial_token.split(cuts)
# --> this is a Tokens instance, not a Token one ! (see documentation for explanation)

# tokens conserve the cutted parts, but behaves like a list
interesting_tokens = tokens[1::2]
# so one has to take only odd elements

# n-gram construction
ngram = interesting_tokens.slice(0,len(interesting_tokens),2)
ngram[2]
# return Token('for demonstration', 2 ranges)
str(ngram[2])
# return 'for demonstration'

# add attributes to a Token
tok0 = interesting_tokens[0]
tok0.setattr('name_of_attribute',{'some_key':'some_value'})
# and take the attribute back
tok0.name_of_attribute
# return {'some_key':'some_value'}

# are the two 'for' Token the same ?
interesting_tokens[2] == interesting_tokens[-2]
# return no, because they are not at the same position

# reconstruction of a Token
simple_demonstration = interesting_tokens[0:5:3].join()
# one could have done interesting_tokens.join(0,5,3) as well

# it contains two non-overlapping sub-parts
str(simple_demonstration)
# return 'Simple demonstration'

# basic string methods from Python are still there
simple_demonstration.lower()
# return 'simple demonstration'

Other examples can be found in the documentation folder.

About us

Package developped for Natural Language Processing at IAM : Unité d'Informatique et d'Archivistique Médicale, Service d'Informatique Médicale, Pôle de Santé Publique, Centre Hospitalo-Universitaire (CHU) de Bordeaux, France.

You are kindly encouraged to signal any trouble, and to propose ameliorations and/or suggestions to the authors, via issue or merge requests.

Last version : April 28, 2021

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.7.0

Jan 4, 2023

0.6.2

Feb 17, 2022

0.6.1

Feb 17, 2022

0.5.6

Feb 17, 2022

0.5.5

Jan 27, 2022

0.5.1

Aug 6, 2021

0.5.0

Aug 6, 2021

0.4.2

Aug 6, 2021

0.3.1

May 28, 2021

This version

0.3.0

May 28, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

iamtokenizing-0.3.0.tar.gz (28.6 kB view hashes)

Uploaded May 28, 2021 Source

Built Distribution

iamtokenizing-0.3.0-py3-none-any.whl (30.4 kB view hashes)

Uploaded May 28, 2021 Python 3

Hashes for iamtokenizing-0.3.0.tar.gz

Hashes for iamtokenizing-0.3.0.tar.gz
Algorithm	Hash digest
SHA256	`513ef5c0c5a8258ba6ca5de4d93c6f803883badf7ca5317b48b91e324f763803`
MD5	`6a69a3dd132cf187f1a1d9a299c192af`
BLAKE2b-256	`8e87cd80ade8a663d288e9d9c9a5211d9a7990fe456734d7e05df8e4a7feed83`

Hashes for iamtokenizing-0.3.0-py3-none-any.whl

Hashes for iamtokenizing-0.3.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`845fed2aa5b9d58c755bfb051eec7ed7651f884e77b33c54f110b52f4692fae9`
MD5	`86d2891a9c1b160e7a50885a2fc01e65`
BLAKE2b-256	`8c4763da9abccb50001cf5d1065d6a462d54c3bbbc8dc41a237e395a2b072c09`