ArmSpeech is an offline Armenian speech recognition library (speech-to-text) and CLI tool based on Coqui STT (🐸STT) and trained on the ArmSpeech dataset.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

ArmSpeech: Armenian Speech Recognition Library.

ArmSpeech is an offline Armenian speech recognition library (speech-to-text) and CLI tool based on Coqui STT (🐸STT) and trained on the ArmSpeech dataset. Coqui STT (🐸STT) is an open-source implementation of Baidu’s Deep Speech deep neural network. The engine is based on a recurrent neural network (RNN) and consists of 5 layers of hidden units.

The acoustic model and language model work together to produce better accuracy of prediction. The acoustic model uses a sequence-to-sequence algorithm, to learn which acoustic signals correspond to which letters in the language alphabet (outputs probabilities for each class of character, not at the word level). To distinguish homonyms (words that sound the same but are spelled differently), a language model comes to the rescue, which predicts which words will follow each other in a sequence (n-gram modeling).

For acoustic model training and validating used ArmSpeech Armenian spoken language corpus total of 15.7 hours. Language model training is based on the KenLM Language Model Toolkit library. Necessary data for language model training was scraped from Armenian news websites articles about medicine, sport, culture, lifestyle, and politics.

If want to help me to increase the accuracy of transcriptions, then

API

ArmSpeech can be used both as a Python module and a CLI tool. The library can be used in two ways:

transcribe wav audio file,
transcribe audio stream from microphone.

In both cases audio has the same parameters:

wav audio format,
mono channel,
16000hz sample rate.

Python

Function name	Description
`set_beam_width(self, beam_width: int) -> int`	Set the beam width value of the model (beam width used in the CTC decoder when building candidate transcriptions). A larger beam width value generates better results but increases decoding time. The function takes an integer (`beam_width`) and returns zero on success, and non-zero on failure. The default value is 1024.
`set_scorer_alpha_beta(alpha: float, beta: float) -> int`	Set hyperparameters alpha and beta of the external scorer (language model weight (`alpha`) and word insertion weight (`beta`) of the decoder. The function takes two floats (`alpha`, `beta`) and returns zero on success, and non-zero on failure. The default values are 0.931289039105002 for the `alpha` and 1.1834137581510284 for the `beta`.
`from_wav(self, wav_path: str, get_metadata: bool = False) -> str`	Transcribe wav audio file. The function takes two parameters: the absolute path of the audio file (`wav_path`) and a boolean parameter (`get_metadata`) for enabling metadata generation. `get_metadata` parameter is optional and the default value is false. The function returns either the transcript or a tuple of metadata, which includes the transcript too.
`from_mic(self, vad_aggresivness: int = 3, spinner: bool = False, wav_save_path: str = None, get_metadata = False)`	Transcribe audio stream taken from microphone. The generator function takes four parameters: an integer number (`vad_aggresivness`) in a range of [0, 3] for voice activity detection aggressiveness, a boolean for showing spinner (`spinner`) in the console while detected voice activity, an absolute path (`wav_save_path`) to save transcribed speeches, and a boolean parameter (`get_metadata`) for enabling metadata generation. All the parameters are optional (value of 3 for `vad_aggresivness`, false for `get_metadata` and `spinner`, and empty for `wav_save_path`. The function returns either the transcript or a tuple of metadata, which includes the transcript too.

The from_mic() generator function uses voice activity detection technology to detect speech by simply distinguishing between silence and speech. This is done by using Python free “webrtcvad” module, which is a Python interface to the WebRTC Voice Activity Detector (VAD) developed by Google. The application determines voice activity by a ratio of not null and null frames in 300 milliseconds. The portion of not null frames in given milliseconds must be equal to or greater than 75%.

In from_mic() and from_wav() functions setting the get_metada parameter to true, returns metadata of the audio file or stream, which includes the transcript, confidence score, and position of the token in seconds. An example of returned metadata is below:

('հայերն աշխարհի հնագույն ազգերից մեկն են', -7.672598838806152, ('հ', 0.29999998211860657), ('ա', 0.41999998688697815), ('յ', 0.4399999976158142), ('ե', 0.5), ('ր', 0.5199999809265137), ('ն', 0.5399999618530273), (' ', 0.6800000071525574), ('ա', 0.699999988079071), ('շ', 0.7400000095367432), ('խ', 0.8999999761581421), ('ա', 0.9399999976158142), ('ր', 0.9599999785423279), ('հ', 1.0), ('ի', 1.0399999618530273), (' ', 1.1799999475479126), ('հ', 1.1999999284744263), ('ն', 1.2400000095367432), ('ա', 1.399999976158142), ('գ', 1.5), ('ո', 1.5199999809265137), ('ւ', 1.5799999237060547), ('յ', 1.6799999475479126), ('ն', 1.7799999713897705), (' ', 1.7999999523162842), ('ա', 2.0799999237060547), ('զ', 2.0999999046325684), ('գ', 2.2200000286102295), ('ե', 2.3399999141693115), ('ր', 2.379999876022339), ('ի', 2.4600000381469727), ('ց', 2.4800000190734863), (' ', 2.5), ('մ', 2.679999828338623), ('ե', 2.700000047683716), ('կ', 2.8399999141693115), ('ն', 2.93999981880188), (' ', 2.9600000381469727), ('ե', 2.9800000190734863), ('ն', 3.319999933242798))

CLI

CLI API took 7 optional parameters: wav_path, beam_width, alpha_beta, get_metadata, spinner, vad_aggresivness, and wav_save_path. Descriptions and return values are the same as for Python API. If the wav_path parameter is not empty, then the audio file will be transcribed, else microphone streaming will start.

Install

pip install armspeech

Usage examples

Python

#Import library
from armspeech import ArmSpeech_STT

#Create object
armspeech_stt = ArmSpeech_STT()

#Transcribe wav audio file
result = armspeech_stt.from_wav(wav_path = 'path/to/wav/audio', get_metadata = True)
print(result)

#Start microphone streaming
for result in armspeech_stt.from_mic (vad_aggresivness = 2, spinner = True, wav_save_path = 'path/to/transcribed/speeches', get_metadata = False):
    print(result)

CLI

armspeech_stt_cli --wav_path path/to/wav/audio --beam_width 2048 --alpha_beta 0.7 1.3 --get_metadata True

Author's profiles

Acknowledgements

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.4

Jun 6, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

armspeech-0.1.4.tar.gz (21.1 MB view hashes)

Uploaded Jun 6, 2023 Source

Hashes for armspeech-0.1.4.tar.gz

Hashes for armspeech-0.1.4.tar.gz
Algorithm	Hash digest
SHA256	`29723aee9b13f59c77970b3087f10552aebaadc1ca2fba42bf81ab397c507d9f`
MD5	`5322014bb3973353ecc9048119dfca0c`
BLAKE2b-256	`ee1ab5b452600042f633bb26ad39a46bb6560ef160721a0377251017bc168aa2`