Skip to main content

Automated Audio Captioning datasets in Pytorch.

Project description

Automated Audio Captioning datasets for Pytorch

Python PyTorch Code style: black Build

Automated Audio Captioning Unofficial datasets source code for AudioCaps [1], Clotho [2], and MACS [3], designed for Pytorch.

Installation

pip install git+https://github.com/Labbeti/aac_datasets

or clone the repository :

git clone https://github.com/Labbeti/aac_datasets
pip install -e aac_datasets

Examples

Create Clotho dataset

from aac_datasets import Clotho

dataset = Clotho(root=".", subset="dev", download=True)
audio, captions, *_ = dataset[0]
# audio: Tensor of shape (n_channels=1, audio_max_size)
# captions: list of str captions

Build Pytorch dataloader with MACS

from torch.utils.data.dataloader import DataLoader
from aac_datasets import MACS
from aac_datasets.utils import BasicCollate

dataset = MACS(root=".", download=True)
dataloader = DataLoader(dataset, batch_size=4, collate_fn=BasicCollate())

for audio_batch, captions_batch in dataloader:
    # audio_batch: Tensor of shape (batch_size=4, n_channels=2, audio_max_size)
    # captions_batch: list of list of str captions
    ...

Datasets stats

Here is the statistics for each dataset :

AudioCaps Clotho MACS
Subset(s) train, val, test dev, val, eval, test, analysis full
Sample rate 32000 44100 48000
Estimated size 43GB 27GB 13GB
Audio source AudioSet (youtube) Freesound TAU Urban Acoustic Scenes 2019

Here is the train subset statistics for each dataset :

AudioCaps/train Clotho/dev MACS/full
Nb audios 49838 3840 3930
Total audio duration 136.6h1 24.0h 10.9h
Audio duration range 0.5-10s 15-30s 10s
Nb captions per audio 1 5 2-5
Nb captions 49838 19195 17275
Total nb words2 402482 217362 160006
Nb words range2 1-52 8-20 5-40

1 This duration is estimated on the total duration of 46230/49838 files of 126.7h.

2 The sentences are cleaned (lowercase+remove punctuation) and tokenized using the spacy tokenizer to count the words.

Requirements

Python packages

The requirements are automatically installed when using pip on this repository.

torch >= 1.10.1
torchaudio >= 0.10.1
py7zr >= 0.17.2
pyyaml >= 6.0
tqdm >= 4.64.0

External requirements (AudioCaps only)

The external requirements needed to download AudioCaps are ffmpeg and youtube-dl. These two programs can be download on Ubuntu using sudo apt install ffmpeg youtube-dl.

You can also override their paths for AudioCaps:

from aac_datasets import AudioCaps
AudioCaps.FFMPEG_PATH = "/my/path/to/ffmpeg"
AudioCaps.YOUTUBE_DL_PATH = "/my/path/to/youtube_dl"
_ = AudioCaps(root=".", download=True)

Command line download

To download a dataset, you can use download=True argument in dataset construction. However, if you want to download datasets separately, you can also use the following command :

python -m aac_datasets.download --root "./data" clotho --version "v2.1"

References

[1] C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” in NAACL-HLT, 2019. Available: https://aclanthology.org/N19-1011/

[2] K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An Audio Captioning Dataset,” arXiv:1910.09387 [cs, eess], Oct. 2019, Available: http://arxiv.org/abs/1910.09387

[3] F. Font, A. Mesaros, D. P. W. Ellis, E. Fonseca, M. Fuentes, and B. Elizalde, Proceedings of the 6th Workshop on Detection and Classication of Acoustic Scenes and Events (DCASE 2021). Barcelona, Spain: Music Technology Group - Universitat Pompeu Fabra, Nov. 2021. Available: https://doi.org/10.5281/zenodo.5770113

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

aac_datasets-0.1.1.tar.gz (23.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page