Skip to main content

spoteno is a library for spoken text normalization for ASR

Project description

spoteno

PyPI

spoteno (Spoken-Text-Normalization) is a tool to cleanup text-transcripts for speech recognition systems. These systems normally expect target transcripts to contain only characters from a restricted set.

Installation

Install the latest development version:

pip install git+https://github.com/ynop/spoteno.git

Examples

The default usecase would be to normalize a sentence. This enforces the output string to contain only valid characters (as defined by the configuration).

import spoteno

sentence = ('Am 11. Januar geht er um 5m nach links,'
            'weshalb er $d schon "ziemlich" müde ist.')

norm = spoteno.Normalizer.de()
outsent = norm.normalize(sentence)
print(outsent)

# >>> am elfte januar geht er um fünf m nach links weshalb er d schon ziemlich müde ist

With force=False, the final cleanup can be disabled. This way invalid characters may occurr in the output, if the configuration hasn't handled them specifically.

outsent = norm.normalize(sentence, force=False)
print(outsent)

# >>> am elfte januar geht er um fünf m nach links weshalb er $d schon ziemlich müde ist

With the debug method, one can retrieve a set of invalid characters in the final output. This can be used to create or debug a configuration. Additionaly the outputs of the different configuration steps can be printed.

outsent, error = norm.debug(sentence)
print(error)

# >>> START               Am 11. Januar geht er um 5m nach links,weshalb er $d schon "ziemlich" müde ist.
# >>> Strip               ['Am 11. Januar geht er um 5m nach links,weshalb er $d schon "ziemlich" müde ist.']
# >>> Lower               ['am 11. januar geht er um 5m nach links,weshalb er $d schon "ziemlich" müde ist.']
# >>> StripChar           ['am 11. januar geht er um 5m nach links,weshalb er $d schon "ziemlich" müde ist']
# >>> ReplaceIfNotSurroundedByDigits['am 11. januar geht er um 5m nach links weshalb er $d schon "ziemlich" müde ist']
# >>> ReplaceIfNotPrecededByDigit['am 11. januar geht er um 5m nach links weshalb er $d schon "ziemlich" müde ist']
# >>> ReplaceRegex        ['am 11. januar geht er um 5m nach links weshalb er $d schon "ziemlich" müde ist']
# >>> ReplaceChar         ['am 11. januar geht er um 5m nach links weshalb er $d schon  ziemlich  müde ist']
# >>> ReplaceChar         ['am 11. januar geht er um 5m nach links weshalb er $d schon  ziemlich  müde ist']
# >>> WhitespaceTokenize  ['am', '11.', 'januar', 'geht', 'er', 'um', '5m', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> SplitNumberSuffix   ['am', '11.', 'januar', 'geht', 'er', 'um', '5', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> NumberToWords       ['am', '11.', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> OrdinalNumberToWords['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> ReplaceChar         ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> ReplaceFull         ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> RemoveDiacritics    ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> Strip               ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']
# >>> END                 ['am', 'elfte', 'januar', 'geht', 'er', 'um', 'fünf', 'm', 'nach', 'links', 'weshalb', 'er', '$d', 'schon', 'ziemlich', 'müde', 'ist']k

# >>> {'$'}

Development

Prerequisites

It's recommended to use a virtual environment when developing spoteno. To create one, execute the following command in the project's root directory:

python -m venv .

To install spoteno and all it's dependencies, execute:

pip install -e .

Running the test suite

pip install -e .[dev]
python setup.py test

With PyCharm you might have to change the default test runner. Otherwise, it might only suggest to use nose. To do so, go to File > Settings > Tools > Python Integrated Tools (on the Mac it's PyCharm > Preferences > Settings > Tools > Python Integrated Tools) and change the test runner to py.test.

Versions

Versions is handled using bump2version. To bump the version:

bump2version [major,minor,patch,release,num]

In order to directly go to a final relase version (skip .dev/.rc/...):

bump2version [major,minor,patch] --new-version x.x.x

Release

Commands to create a new release on pypi.

rm -rf build
rm -rf dist

python setup.py sdist
python setup.py bdist_wheel
twine upload dist/*

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spoteno-0.1.1.tar.gz (10.8 kB view hashes)

Uploaded Source

Built Distribution

spoteno-0.1.1-py3-none-any.whl (11.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page