Project description

Trainer

The purpose of the trainer is to provide the user with a flexible way of scheduling various sources of input data, as well as augment the training data with tittle casing, all caps, etc. This is particularly useful when you have multiple data sources and you want to pretrain the model first on backtranslated data, gradually add other sources of data, and finally fine tune, all in one go.

Alternatively, this tool is particularly suited to training multilingual models, as it provides an easy way to define the desired mixture of datasets from different language sources.

Configuration file

Define your training process via a configuration file. You define the datasets on top, the stages and then for each stage a mixing criteria and a stage termination criteria. An example configuration file is provided below. The path to the trainer is a path to any neural network trainer that supports having stdin as training input format.

# Datasets are already TSV files
datasets:
  clean: test/data/clean
  medium: test/data/medium
  dirty: test/data/dirty

stages:
  - start
  - mid
  - end

start:
  - clean 0.8
  - medium 0.2
  - dirty 0
  - until clean 2 # Until two epochs of clean

mid:
  - clean 0.6
  - medium 0.3
  - dirty 0.1
  - until medium 1

end:
  - clean 0.4
  - medium 0.3
  - dirty 0.3
  - until dirty 5 # use `inf` to mean until forever

modifiers:
- uppercase 0.05 # Apply uppercase randomly to 0.05% of sentences. Use 0 to disable
- titlecase 0.05 # Apply titlecase randomly to 0.05% of sentences. Use 0 to disable

seed: 1111
trainer: /path/to/trainer/run.py

Usage

% ./trainer.py --help
usage: trainer.py [-h] --config CONFIG [--temporary-directory TEMPORARY_DIR] [--state STATE_FILE] [--do-not-resume] [--sync] [trainer-command [arguments]]

Feeds marian tsv data for training.

options:
  -h, --help            show this help message and exit
  --config CONFIG, -c CONFIG
                        YML configuration input.
  --temporary-directory TEMPORARY_DIR, -t TEMPORARY_DIR
                        Temporary dir, used for shuffling and tracking state
  --state STATE_FILE    Path to trainer state file which stores how much of
                        each dataset has been read. Defaults to ${CONFIG}.state
  --sync                Do not shuffle in the background
  --do-not-resume, -d   Do not resume from the previous training state

Once you fix the paths in the configuration file, train_config.yml you can run a test case by doing:

./trainer.py -c train_config.yml

You can check resulting mixed file in /tmp/test. If your neural network trainer doesn't support training from stdin, you can use this tool to generate a training dataset and then disable data reordering or shuffling at your trainer implementation, as your training input should be balanced.

At the start of the training all datasets are shuffled. Each time a dataset's end is reached, it is re-shuffled. Shuffling in the system temp directory but can be repositioned using --temporary-directory or the TMPDIR environment variable. By default, the training state is kept in the same place as the configuration file. If training is interrupted, re-running the trainer should resume from where it was (depending on how much your neural network trainer has buffered, that part will be skipped).

Generating vocabulary and placeholders before training

To use the placeholder code augment your training data with placeholders before training, look at this example script:

#!/usr/bin/env bash
# Get the placeholders
../placeholders/placeholders.py -c train_config_bgen.yml --dump_placeholders > my_placeholders
# train vocabulary
spm_train --bos_id=-1 --eos_id=0 --unk_id=1 --user_defined_symbols_file my_placeholders \
  --model_prefix="test/vocab.bgen" --vocab_size=12000 \
  --input="/home/dheart/uni_stuff/postdoc/empty-train/trainer/test/data/clean.bgen" \
  --shuffle_input_sentence=true --character_coverage 1

# Move vocabulary to the new location
mv test/vocab.bgen.model test/vocab.bgen.spm

# Make all datasets placeholded
for myfile in test/data/*.bgen; do
	../placeholders/placeholders.py -n --strict --encode -c train_config_bgen.yml < ${myfile} > ${myfile}.pls
done

You need to augment the training configuration with additional placeholder configuration setting:

vocab: /home/dheart/uni_stuff/postdoc/empty-train/trainer/test/vocab.bgen.spm
placeholder-symbol: "<PLACEHOLDER>"
num-placeholders: 4
regexes:
    - (https?:\/\/www\.\w{1,63}\.\w{1,63}(?:\/\w{0,63}){0,})
    - (www\.\w{1,63}\.\w{1,63}(?:\/\w{0,63}){0,})
    - ([a-zA-Z0-9_.+-]+@[a-zA-Z0-9-]+\.[a-zA-Z0-9-.]+)

After vocabulary is trained and data is preprocessed, proceed with a normal training run.

Future work

Terminology support (using a dictionary). We should augment the training data with terminology (possibly stemmed on the source side) so that we can use it real world models
A one click run training

Acknowledgements

This project has received funding from the European Union’s Horizon Europe research and innovation programme under grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.2

Jun 22, 2023

This version

0.1

Mar 3, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

opustrainer-0.1-py3-none-any.whl (14.3 kB view hashes)

Uploaded Mar 3, 2023 Python 3

Hashes for opustrainer-0.1-py3-none-any.whl

Hashes for opustrainer-0.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a7b86657003ea4ca591f126a492c7ba75d9bd14f78f710aefaf5200d6bc7447f`
MD5	`db8582786bbf9b0ab04b12e52e68bed5`
BLAKE2b-256	`4260840a0884ec3ea6269eef01a48ba1e261a056eb01c544fdd8be774e6282fe`