A Python package to write TFRecord files easily and seamlessly into Tensorflow Datasets format.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

smart-tfrecord-writer

License Development Status PyPI Downloads

smart-tfrecord-writer helps researchers and practitioners convert their data over to TFRecord format with ease and only a few lines of code! This repo is under active development so please let us know if you encounter bugs, have feature requests, or documentation/code is unclear!

smart-tfrecord-writer

Why use smart-tfrecord-writer?

TFRecord format has many benefits over other file formats, but it can be difficult and cumbersome to convert data over to TFRecord format without experience. smart-tfrecord-writer aims to provide a simple, lightweight codebase (only need to modify 2 small functions!) to speed up this data transformation and allow researchers and practitioners to focus creating models and less time spent on data conversion. Using smart-tfrecord-writer allows you to utilize great simplifications that come with Tensorflow Datasets like tfds.load() and tfds.features for your own custom (or not yet TFRecord converted) datasets!

Many of the benefits of using TFRecord format are efficient data storage, optimized data loading, seamless integration with ML pipelines, etc. More can be learned on the TensorFlow Documentation.

Installation

PyPI

pip install smart-tfrecord-writer

Usage

smart-tfrecord-writer aims to assist researchers in saving their data into TFRecord format with miminal code. A base writer class smart_tfrecord_writer.Writer() is provided and two (2) small functions must be defined when subclassing Writer.

Required Subclassing Functions

Writer.features() uses the Tensorflow-Datasets (tfds) features module so that the writer can understand the structure of the data being saved. A simple example can be seen below and in the examples directory:

def features(self):
    features = tfds.features.FeaturesDict(
        {
            "rf_signal": tfds.features.Tensor(
                shape=(1024, 2),
                dtype=np.float32,
                doc="A radio signal with I and Q components.",
            ),
            "label": tfds.features.ClassLabel(
                names=["OOK", "4ASK",..., "OQPSK"],
            ),
            "snr": tfds.features.Scalar(
                dtype=np.float32, doc="Average SNR of the signal."
            ),
        }
    )

    return features

In this example, there are three (3) fields in each example within the dataset: rf_signal, label, and snr. To get a better understanding of the different types of features, we recommend starting at the tfds.features documentation. In general, if your data does not fit within the provided features, tfds.features.Tensor is generic enough to provide flexibility.

Writer.process_data() processes a single element/example within your dataset from the iterator you provide to Writer.write_records(). If your data is already processed and is just being iterated over (e.g., numpy arrays), then this function can just return the values being iterated over. However, the output of this function must be a Python dictionary with the same keys as in the features object from Writer.features(). Here is a simple example that can also be found in the examples directory:
```
def process_data(self, index):
    parsed_instance = {}

    with h5py.File(self.source_directory, "r") as f:
        rf_signal = f["X"][index]
        label = f["Y"][index]
        snr = f["Z"][index]

        parsed_instance[
            "rf_signal"
        ] = rf_signal  # Generic Tensor feature expects shape of (1024, 2)
        parsed_instance["label"] = np.argmax(
            label
        )  # ClassLabel feature expects single integer value
        parsed_instance["snr"] = np.squeeze(
            snr.astype(np.float32)
        )  # Scalar feature expects a single value

    return parsed_instance
```
Notice how parsed_instance has the same keys as features (rf_signal, label, snr). The values in parsed_instance also have datatypes that are supported by the respective components inside features.

If your data pipeline requires more processing, loading images for example, you can adjust the process_data() function as needed. It will always receive a single element from the iterator you provide. In the example above, indicies within an hdf5 file were provided. This also could have been filepaths to load, a single row within a numpy array, etc. The main takeaway is how you process a single example.

Optional Functions

If you would like to provide additional meta information for your dataset, you can also subclass the Writer.extend_meta_data() function. This is a simple function that allows you to provide additional details about your dataset. Currently, the supported additional meta information fields are: description, homepage, supervised_keys, and citation. Writer.extend_meta_data() assumes that these are returned in a tuple with this exact order as seen in the examples directory. Some of these are self-explanatory, but a potentially useful field is supervised_keys.

If your dataset contains labeled information for classification, you can provide the (data, label) pair from your features dictionary and use tfds.load(..., as_supervised=True) resulting in a TensorFlow dataset (tf.data.Dataset) object that can be iterated over where each iteration yields a (data, label) pair that is compatible with most classification models and data pipelines. An example of this can be seen in the examples directory.

extended_dataset_info() may also be subclassed and will write additional dataset information to extended_dataset_info.json. This may be helpful if you have additional information about your dataset that does not conform the the tfds.DatasetInfo object. This is a simple function that returns a dictionary of the information you would like to save. Because extended_dataset_info() is saved into a json file, the values must be json serializable. An example can be seen below:

        def extended_dataset_info(self):
            return {
                "additional_info": "This is an example of additional dataset information.",
                "more_info": "This is another example of additional dataset information.",
            }

Datastructures for Writer.write_records()

If using splits_info, the assumed data structure is a list of Python dictionaries. Each list element is assumed to contain a two key value pairs with the structure:

{"name": "<split_name>", "info": split_info}

Replace <split_name> with a string of the name of the respective split (e.g., train, val, test, etc.). For example:

    # Example indices
    train_indexes = list(range(3_000))
    test_indexes = list(range(1_000))

    # Structure splits information for the writer into a list of dictionaries
    splits_info = [
        {"name": "train", "info": train_indexes},
        {"name": "test", "info": test_indexes},
    ]

split_info is intended to be generic to fit many use cases. The only restriction is that it needs to be an iterable where a single element can be processed by the Writer.process_data() function. For example, this could be a list of indices, file names, etc.! The only limitation currently is that the length of the iterator (len(iter)) must be defined. This is a result of wanting to make our code compatible with the tfds.load() functionality. Support for unknown length iterators may be added in the future. An example usage of this can be seen in the examples directory where the iterator is a list of indices for train and test splits.

If you want to provide you own dataset splits, you can pass splits_shards instead of split_info to Writer.write_records() with a similar datastructure, but instead of "info", the key will be "shards" and each shards element is an iterable. For example:

      # Structure shards information for the writer into a list of dictionaries
      splits_info = [
          {"name": "train", "shards": [[1, 2, 3], [4, 5, 6]]},
          {"name": "test", "shards": [[7, 8, 9], [10, 11, 12]]},
      ]

Additional Writer.write_records() Parameters

shuffle: Whether or not to shuffle the provided data in "info" or "shards".
random_seed: Random seed used for shuffling for reproducibility
examples_per_shard: If you know how many examples you want in each shard, provide it here!
mb_per_shard: If you aren't sure how many examples you want per shard, but you know the approximate amount of memory for each shard you want to use, allow smart-tfrecord-writer to figure out how many examples per shard for you! A good rule of thumb is 100 MB+ to take advantage of parallelism. More on that in TensorFlow documentation. Note that this is just an estimate and does not guarantee an exact amount of memory per shard.
n_estimates_mb_per_example: If you are using mb_per_shard, n_estimates_mb_per_example: Tells smart-tfrecord-writer how many examples it should iterate over to get an estimate for how much memory each example will use. If your data is uniform in shape (e.g., all images are $224 \times 224 \times 3$), then 1 is sufficient. If your data is variable shaped, the higher this number is, the better the estimate, but will take more time the higher the value is set.
n_estimates_mb_per_split_example: Similar to n_estimates_mb_per_example, but allows a dictionary of {split: num_estimates} to be used so that each split computes a different estimate for examples per shard. May be useful if the evaluation set has differing shapes than the training set for example.

Contributing

Contributions are welcome! If you'd like to contribute to the project, please follow these guidelines:

Fork the repository.
Create a new branch.
Make your changes.
Test your changes thoroughly.
Submit a pull request.
Add caharper as a reviewer.

Please ensure your code adheres to the project's coding style and conventions.

Why use smart-tfrecord-writer instead of TFDS CLI?

The TFDS CLI is great! The purpose of smart-tfrecord-writer is to further reduce the complexity of formatting data and saving data into the TFRecord format. Instead of having to read verbose documentation, we hope to make the process as simple as possible by providing just a few parameters and overwriting a few functions to get the most out of your ML pipeline.

Most projects could benefit from using TFRecords, but it requires a lot of reading, looking at several coding examples, etc. Our project aims to help limit the time spent on converting your data over to the TFRecord format so you can take advantage of the speed of TFRecord and start training your models and not waste time elsewhere. TFDS CLI will give you more control at the cost of more documentation and formatting, but most datasets can be handled in a simpler way, enter smart-tfrecord-writer!

Gotchas

Right now, tf.data.Dataset objects are not supported. A few assumptions in our codebase (knowing the length of the dataset for splitting into shards and writing individual shards instead of looping over a single object) cause this limitation. The second is problematic for a tf.data.Dataset because it is not guaranteed to be deterministic so we would need to have a separate handler for tf.data.Dataset objects.

License

This project is licensed under the Apache License.

Contact

If you have any questions, suggestions, or feedback, feel free to reach out:

Email: caharper@smu.edu
GitHub: caharper

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.0

Feb 6, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

smart_tfrecord_writer-0.1.0.tar.gz (18.3 kB view hashes)

Uploaded Feb 6, 2024 Source

Built Distribution

smart_tfrecord_writer-0.1.0-py3-none-any.whl (15.4 kB view hashes)

Uploaded Feb 6, 2024 Python 3

Hashes for smart_tfrecord_writer-0.1.0.tar.gz

Hashes for smart_tfrecord_writer-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`af4f824e399cc1c6b873e19523bb791e3824b0ee9689efc85bb3d5e54267430e`
MD5	`c1df84522454bf80351f2a5d9ca33e52`
BLAKE2b-256	`71d6dddfb10b134527453a170864e828ba01539a360f20a746e4ba4311e025a3`

Hashes for smart_tfrecord_writer-0.1.0-py3-none-any.whl

Hashes for smart_tfrecord_writer-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`8c15f91bc6c98437dd2862798b5839bc2e7c29d94a1b0ad874aae0c0c1eb3f5b`
MD5	`44b8ca313965851eb79337819a5364bb`
BLAKE2b-256	`d299ac329347fecf0f0f185b6bd8cf6629ebf50ec2eb4d024a096657fbae7dd7`