Python3 library for converting between various LLM dataset formats.
Project description
The llm-dataset-converter allows the conversion between various dataset formats for large language models (LLMs). Filters can be supplied as well, e.g., for cleaning up the data.
Dataset formats: * pairs: alpaca (r/w), csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w) * pretrain: csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w), txt (r/w) * translation: csv (r/w), jsonl (r/w), parquet (r/w), tsv (r/w), txt (r/w)
Compression formats: * bzip * gzip * xz * zstd
Examples:
Simple conversion with logging info:
llm-convert \ from-alpaca \ -l INFO \ --input ./alpaca_data_cleaned.json \ to-csv-pr \ -l INFO \ --output alpaca_data_cleaned.csv
Automatic decompression/compression (based on file extension):
llm-convert \ from-alpaca \ --input ./alpaca_data_cleaned.json.xz \ to-csv-pr \ --output alpaca_data_cleaned.csv.gz
Filtering:
llm-convert \ -l INFO \ from-alpaca \ -l INFO \ --input alpaca_data_cleaned.json \ keyword \ -l INFO \ --keyword function \ --location any \ --action keep \ to-alpaca \ -l INFO \ --output alpaca_data_cleaned-filtered.json
Changelog
0.0.2 (2023-10-31)
added text-stats filter
stream writers accept iterable of data records now as well to improve throughput
text_utils.apply_max_length now uses simple whitespace splitting instead of searching for nearest word boundary to break a line, which results in a massive speed improvement
fix: text_utils.remove_patterns no longer multiplies the generated lines when using more than one pattern
added remove_patterns filter
pretrain and translation text writers now buffer records by default (-b, –buffer_size) in order to improve throughput
jsonlines writers for pair, pretrain and translation data are now stream writers
0.0.1 (2023-10-26)
initial release
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Hashes for llm-dataset-converter-0.0.2.tar.gz
Algorithm | Hash digest | |
---|---|---|
SHA256 | 0a75f8c1bef12aa90f9184184c3534a35590c82626cb62ce9c9b8985073366f0 |
|
MD5 | 81b61722b63e646fe2aa8a1789cbaa83 |
|
BLAKE2b-256 | 564e899366eddf224716ec63b66e285e58ee36161d63aca3b6e58cd972aa11d4 |