llm-dataset-converter

Python3 library for converting between various LLM dataset formats.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

The llm-dataset-converter allows the conversion between various dataset formats for large language models (LLMs). Filters can be supplied as well, e.g., for cleaning up the data.

Dataset formats:

pairs: alpaca (r/w), csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w)
pretrain: csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w), txt (r/w)
translation: csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w), txt (r/w)

Compression formats:

bzip
gzip
xz
zstd

Examples:

Simple conversion with logging info:

llm-convert \
  from-alpaca \
    -l INFO \
    --input ./alpaca_data_cleaned.json \
  to-csv-pr \
    -l INFO \
    --output alpaca_data_cleaned.csv

Automatic decompression/compression (based on file extension):

llm-convert \
  from-alpaca \
    --input ./alpaca_data_cleaned.json.xz \
  to-csv-pr \
    --output alpaca_data_cleaned.csv.gz

Filtering:

llm-convert \
  -l INFO \
  from-alpaca \
    -l INFO \
    --input alpaca_data_cleaned.json \
  keyword \
    -l INFO \
    --keyword function \
    --location any \
    --action keep \
  to-alpaca \
    -l INFO \
    --output alpaca_data_cleaned-filtered.json

Changelog

0.0.4 (2023-12-19)

pairs-to-llama2 filter now has an optional –prefix parameter to use with the instruction
added the pretrain-sentences-to-pairs filter for generating artificial prompt/response datasets from pretrain data
requires seppl>=0.0.11 now
the LDC_MODULES_EXCL environment variable is now used for specifying modules to be excluded from the registration process (e.g., used when generating help screens for derived libraries that shouldn’t output the base plugins as well)
llm-registry and llm-help now allow specifying excluded modules via -e/–excluded_modules option
to-alpaca writer now has the -a/–ensure_ascii flag to enforce ASCII compatibility in the output
added global option -u/–update_interval to convert tool to customize how often progress of # records processed is being output in the console (default: 1000)
text-length filter now handles None values, i.e., ignores them
locations (i.e., input/instructions/output/etc) can be specified now multiple times
the llm-help tool can generate index files for all the plugins now; in case of markdown it will link to the other markdown files

0.0.3 (2023-11-10)

added the record-window filter
added the llm-registry tool for querying the registry from the command-line
added the replace_patterns method to ldc.text_utils module
added the replace-patterns filter
added -p/–pretty-print flag to to-alpaca writer
added pairs-to-llama2 and llama2-to-pairs filter (since llama2 has instruction as part of the string, it is treated as pretrain data)
added to-llama2-format filter for pretrain records (no [INST]…[/INST] block)
now requiring seppl>=0.0.8 in order to raise Exceptions when encountering unknown arguments

0.0.2 (2023-10-31)

added text-stats filter
stream writers accept iterable of data records now as well to improve throughput
text_utils.apply_max_length now uses simple whitespace splitting instead of searching for nearest word boundary to break a line, which results in a massive speed improvement
fix: text_utils.remove_patterns no longer multiplies the generated lines when using more than one pattern
added remove-patterns filter
pretrain and translation text writers now buffer records by default (-b, –buffer_size) in order to improve throughput
jsonlines writers for pair, pretrain and translation data are now stream writers

0.0.1 (2023-10-26)

initial release

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.2.3

May 6, 2024

0.2.2

May 3, 2024

0.2.1

May 2, 2024

0.2.0

Feb 27, 2024

0.1.1

Feb 15, 2024

0.1.0

Feb 5, 2024

0.0.5

Jan 23, 2024

This version

0.0.4

Dec 19, 2023

0.0.3

Nov 10, 2023

0.0.2

Oct 31, 2023

0.0.1

Oct 26, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

llm-dataset-converter-0.0.4.tar.gz (73.2 kB view hashes)

Uploaded Dec 19, 2023 Source

Hashes for llm-dataset-converter-0.0.4.tar.gz

Hashes for llm-dataset-converter-0.0.4.tar.gz
Algorithm	Hash digest
SHA256	`b20f3da37e560169b96df5e1a56a389baa7cb5f747911822bae99677fcf193f0`
MD5	`7b7722c94bd4694268ac80b39e391b8d`
BLAKE2b-256	`4fb17279cc1fc243a96fed3a84171eeb2e9a0e6209a583d66e95f1bbd03342cc`