DiarizationLM

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

DiarizationLM

Overview

Here we open source some functions and tools used in the DiarizationLM paper.

Disclaimer

This is NOT an official Google product.

Instructions

Install the package

You can install the package with:

pip install diarizationlm

Once installed, you can directly use many of the existing functions from the package. For example:

import diarizationlm

src_text = "hello good morning hi how are you pretty good"
src_spk = "1 1 1 2 2 2 2 1 1"
tgt_text = "hello morning hi hey are you be good"
tgt_spk = "1 2 2 2 1 1 2 1"
transferred_spk = diarizationlm.transcript_preserving_speaker_transfer(
    src_text, src_spk, tgt_text, tgt_spk)
print(transferred_spk)

Data format

We assume all internal data are stored in JSON files. An example is testdata/example_data.json. The field "utterances" stores a list of utterances, and in each utterance we have these string fields:

Field	Description
`"utterance_id"`	This stores the utterance ID.
`"hyp_text"`	This stores the sequence of hypothesis words, but joined by spaces.
`"hyp_spk"`	This stores the sequence of hypothesis speakers, but joined by spaces.
`"hyp_diarized_text"`	This is the text representation of the hypothesis words and speakers. It can be used for debugging and to build the prompts to LLM.
`"ref_*"`	Similar to the `"hyp_*"` fields, but these are ground truth reference, rather than hypothesis.

Conversion between representations

In the paper, we mentioned two representations:

The word sequence and speaker sequence representation.
The pure text representation.

Example:

Word sequence:         ["good", "morning", "how", "are", "you"]
Speaker sequence:      [1, 1, 2, 2, 2]
Text representation:   "<spk:1> good morning <spk:2> how are you"

We provide the functions in diarizationlm/utils.py to convert between these two representations:

create_diarized_text() converts the word and speaker sequences to the pure text representation.
extract_text_and_spk() converts the pure text representation to the word and speaker sequences.

Transcript-preserving speaker transfer (TPST)

TPST is a critical data processing algorithm used in multiple places in our paper.

A Python implementation is available in diarizationlm/utils.py, defined as:

def transcript_preserving_speaker_transfer(
    src_text: str, src_spk: str, tgt_text: str, tgt_spk: str
) -> str

Training data preparation

We provide a Python script train_data_prep.py that can be used for preparing the dataset for finetuning LLMs (i.e. the prompt builder module described in the paper). This tool will do these for you:

Segment the prompts and completions based on the input and output length limit.
Optionally apply prefix and suffix to prompts and completions.
Store prompt-completion pairs in different file formats.

The segmentation length, prefix, and suffix are passed in as flags to train_data_prep.py. In Python code, they are configured as PromptOptions defined in utils.py.

We support 3 different output file formats:

Format	Description
`tfrecord`	The TFRecord format can be used by various machine learning libraries.
`json`	This format is more human readable and can be used for debugging. It's also useful for finetuning PaLM models via the Google Cloud API.
`csv`	This format can be used by many existing tools. OpenAI also provides a tool to convert csv files to jsonl files.
`jsonl`	This format can be directly used by the OpenAI API for finetuning GPT models.

Example command:

python3 train_data_prep.py \
--input="testdata/example_data.json" \
--output="/tmp/example_data.jsonl" \
--output_type=jsonl \
--emit_input_length=1000 \
--emit_target_length=1000 \
--prompt_suffix=" --> " \
--completion_suffix=" [eod]" \
--input_feature_key="prompt" \
--output_feature_key="completion"

LLM finetuning and inference

Warning: This step is very costly! Proceed with caution at your own risk. Also GPT models are very different from PaLM models. Reproducibility is not guaranteed!

In our paper, we used Google's internal tools to finetune PaLM 2 models and to run the model inference. Google's policy does not allow us to disclose any details about the tools and the PaLM 2 models.

However, if you are interested in reproducing some of our experiments, one option is to use other alternative LLMs, such as OpenAI's GPT models.

Using the train_data_prep.py tool mentioned above, you can create csv files, and use OpenAI libraries to convert to the jsonl format. Example command:

openai tools fine_tunes.prepare_data -f train_data.csv

Once you have the training data in jsonl format, you can finetune GPT models with the data, either via the API or using OpenAI's web UI. For example:

openai api fine_tunes.create -t "train_data.jsonl"

After you have finetuned a model, we provide a Python script run_finetuned_gpt.py to run the GPT model inference on testing data. You need to provide your --api_key and --engine to the script.

Completion parser

During inference, the prompts are send to the LLM, and the LLM will generate the completions. We provide a postprocess_completions.py script that serves as the completion parser module as described in the paper. It will:

Truncate the completion suffix, and any text generated after this suffix.
Concatenate the completions of all segments from the same utterance.
Transfer the speakers to the original hypothesis ASR transcript.

Citation

Our paper is cited as:

@article{wang2024diarizationlm,
  title={{DiarizationLM: Speaker Diarization Post-Processing with Large Language Models}},
  author={Quan Wang and Yiling Huang and Guanlong Zhao and Evan Clark and Wei Xia and Hank Liao},
  journal={arXiv preprint arXiv:2401.03506},
  year={2024}
}

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.5

Mar 20, 2024

0.0.4

Mar 20, 2024

0.0.3

Mar 20, 2024

0.0.2

Mar 20, 2024

0.0.1

Mar 20, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

diarizationlm-0.0.5.tar.gz (20.2 kB view hashes)

Uploaded Mar 20, 2024 Source

Built Distribution

diarizationlm-0.0.5-py3-none-any.whl (18.7 kB view hashes)

Uploaded Mar 20, 2024 Python 3

Hashes for diarizationlm-0.0.5.tar.gz

Hashes for diarizationlm-0.0.5.tar.gz
Algorithm	Hash digest
SHA256	`5b14f6bd66ebd04e5f2f69bf0123e080a1d0667841272e5cebbc4615a30e1b90`
MD5	`72454be2a68a7806b42484052033de9f`
BLAKE2b-256	`0e8733afd1b27f8380a89f04807a3ad2d871da07602d1c5bc8e77a532c382e7d`

Hashes for diarizationlm-0.0.5-py3-none-any.whl

Hashes for diarizationlm-0.0.5-py3-none-any.whl
Algorithm	Hash digest
SHA256	`151565865025b975159b6a840919b1b924e0c4fef7808b35da4d7f4fbfa84094`
MD5	`f280a9bb895b268c0df574ff0d71f13d`
BLAKE2b-256	`69a4bc85c74ecae148c52dca5cadec094aeb5c62a6a3982c5e7bf770d9d66dba`