audiodatasets

Pulls and pre-processes major Open Source (non-commercial mostly) datasets for spoken audio

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

https://img.shields.io/pypi/v/audiodatasets.svg

https://img.shields.io/travis/mcfletch/audiodatasets.svg

Pulls and pre-processes major Open Source datasets for spoken audio

Supported Datasets:
- Librispeech (60GB)
- TEDLIUM_release2 (35GB)
- VCTK-Corpus (11GB)
This is intended for use on Linux servers and it is expected that you will be using the library to feed a machine learning system (not necessary, but that’s sort of the point of collecting these datasets)
MIT license for the software, but please note that the datasets themselves are generally for non-commercial use only

Features

Downloads common Open Source datasets and performs basic preprocessing on them
Provides iterables that produce Numpy arrays from the audio data in common formats
Uses sphfile to directly accesses sph files instead of needing to convert to wav first
Uses a single shared location for the datasets intended to be used by multiple projects

Installation/Setup

You need to create the download directory and make it writable by the running user. Preferably you will do that via group-based permissions to allow sharing, but we will here show creation of a user-specific ownership:

$ mkdir -p /var/datasets
$ chown user:group /var/datasets
$ chmod g+rw /var/datasets

if /var/datasets doesn’t exist, or isn’t writable, the downloader will instead populate ~/.config/datasets with the data. You may wish to link that directory to /var/datasets so that you can use default instantiations of the corpora:

$ ln -s /var/datasets ~/.config/datasets

Note that the downloader expects that you have the following available, this may not yet be the case in a docker or minimal OS installation:

tar

wget

Now you can download the datasets.

From a command prompt:

$ pip install audiodatasets
# this will download 100+GB and then unpack it on disk, it will take a while...
$ audiodatasets-download

Creating MFCC data-files:

# this will generate Multi-frequency Cepestral Coefficient (MFCC) summaries for the
# audio datasets (and download them if that hasn't been done). This isn't necessary
# if you are doing only raw-audio processing
$ audiodatasets-preprocess

Playing some audio:

# this will iterate through playing every utterance that includes 'moon' in the transcript
$ audiodatasets-search 'moon'

Usage

Once setup, you likely want to iterate over the data-sets using, for instance, a partition to separate out test/train/validate data. To iterate over the raw audio:

from audiodatasets.corpora import build_corpora, partition
import random

def train_valid_test():
    """Create training, validation and tests datasets

    returns three iterators yielding (array[10:512],transcript) batches
    """
    utterances = []
    for corpus in build_corpora():
        utterances.extend( corpus.iter_utterances())
    random.shuffle(utterances)
    train, test,valid = partition( utterances, (3,1,1) )
    def generation( utterances ):
        while True:
            offset = random.randint(0,511)
            for name,transcript,audio_file in utterances:
                for batch in t.iter_batches( audio_file, batch_size=10, input=512, offset=offset ):
                    yield batch,transcript
    return generation(train),generation(test),generation(valid)

To iterate over the 10ms MFCC preprocessed data, which yields 20 frequency batches per processing window (10ms):

from audiodatasets.corpora import build_corpora, partition
import random

def train_valid_test():
    """Create training, validation and tests datasets

    Note: the batches vary in *time* at highest frequency, while
    the frequency bins are the second-highest frequency.

    See: `LibRosa MFCC <https://librosa.github.io/librosa/generated/librosa.feature.mfcc.html>`_

    returns three iterators yielding (array[10:20:63],transcript) batches
    """
    utterances = []
    for corpus in build_corpora():
        utterances.extend( corpus.mfcc_utterances())
    random.shuffle(utterances)
    train, test,valid = partition( utterances, (3,1,1) )
    def generation( utterances ):
        while True:
            offset = random.randint(0,62)
            for name,transcript,audio_file in utterances:
                for batch in t.iter_batches( audio_file, batch_size=10, input=63, offset=offset ):
                    yield batch,transcript
    return generation(train),generation(test),generation(valid)

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

1.0.0

Jun 2, 2017

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

audiodatasets-1.0.0.tar.gz (17.1 kB view hashes)

Uploaded Jun 2, 2017 Source

Hashes for audiodatasets-1.0.0.tar.gz

Hashes for audiodatasets-1.0.0.tar.gz
Algorithm	Hash digest
SHA256	`c176afd3190f93a3da7ed7701b058156b361cecbb1bb04ac10bce18775109856`
MD5	`ab5ef00365338c1f6b66bfd87357dad6`
BLAKE2b-256	`ebcadb5183673bfb710d0d6d2e47633d3e3f48ad462c4cf16ee11787c630c010`

audiodatasets 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

Features

Installation/Setup

Usage

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution