mosestokenizer

Wrappers for several pre-processing scripts from the Moses toolkit.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

This package provides wrappers for some pre-processing Perl scripts from the Moses toolkit, namely, normalize-punctuation.perl, tokenizer.perl, detokenizer.perl and split-sentences.perl.

Sample Usage

All provided classes are importable from the package mosestokenizer.

>>> from mosestokenizer import *

All classes have a constructor that takes a two-letter language code as argument ('en', 'fr', 'de', etc) and the resulting objects are callable.

When created, these wrapper objects launch the corresponding Perl script as a background process. When the objects are no longer needed, you should call the .close() method to close the background process and free system resources.

The objects also support the context manager interface. Thus, if used within a with block, the .close() method is invoked automatically when the block exits.

The following two usages of MosesTokenizer are equivalent:

>>> # here we will call .close() explicitly at the end:
>>> tokenize = MosesTokenizer('en')
>>> tokenize('Hello World!')
['Hello', 'World', '!']
>>> tokenize.close()

>>> # here we take advantage of the context manager interface:
>>> with MosesTokenizer('en') as tokenize:
>>>     tokenize('Hello World!')
...
['Hello', 'World', '!']

As shown above, MosesTokenizer callable objects take a string and return a list of tokens (strings).

By contrast, MosesDetokenizer takes a list of tokens and returns a string:

>>> with MosesDetokenizer('en') as detokenize:
>>>     detokenize(['Hello', 'World', '!'])
...
'Hello World!'

MosesSentenceSplitter does more than the name says. Besides splitting sentences, it will also unwrap text, i.e. it will try to guess if a sentence continues in the next line or not. It takes a list of lines (strings) and returns a list of sentences (strings):

>>> with MosesSentenceSplitter('en') as splitsents:
>>>     splitsents([
...         'Mr. Smith is away.  Do you want to',
...         'leave a message?'
...     ])
...
['Mr. Smith is away.', 'Do you want to leave a message?']

MosesPunctuationNormalizer objects take a string as argument and return a string:

>>> with MosesPunctuationNormalizer('en') as normalize:
>>>     normalize('«Hello World» — she said…')
...
'"Hello World" - she said...'

License

This library is free software; you can redistribute it and/or modify it under the terms of the GNU Lesser General Public License as published by the Free Software Foundation; either version 2.1 of the License, or (at your option) any later version.

This library is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU Lesser General Public License for more details.

You should have received a copy of the GNU Lesser General Public License along with this library; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.2.1

Oct 22, 2021

1.1.0

Oct 24, 2019

This version

1.0.0

May 24, 2017

0.5.0

May 24, 2017

0.3.0

Aug 17, 2016

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mosestokenizer-1.0.0.tar.gz (33.7 kB view hashes)

Uploaded May 24, 2017 Source

Built Distribution

mosestokenizer-1.0.0-py3-none-any.whl (51.2 kB view hashes)

Uploaded May 24, 2017 Python 3

Hashes for mosestokenizer-1.0.0.tar.gz

Hashes for mosestokenizer-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`2d65a781add83e93612a5e491a2cfc9c3740048b8a028556a4e23fceb1a7d48a`
MD5	`ab8f1fea7c23bfbc132f36aee4eb1808`
BLAKE2b-256	`dc12cdc143b9e13c3f235ff10de86a16c8074982be6b5b22be9724603bb4872a`

Hashes for mosestokenizer-1.0.0-py3-none-any.whl

Hashes for mosestokenizer-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4de94102c00ad21ea26c1d8327bf72d38288c6c83c9b2920fb9a86c66eddf8b7`
MD5	`dd56b4ad98df0fef082caceb0f3b2a9d`
BLAKE2b-256	`45c6913c968e5cbcaff6cdd2a54a1008330c01a573ecadcdf9f526058e3d33a0`