Fondant - Composable pipelines for foundation model finetuning
Reason this release was yanked:
Old development version
Project description
Fondant
Fondant is a framework that speeds up the creation of KubeFlow pipelines to process big datasets and train Foundation Models such as:
- Stable Diffusion
- CLIP
- Large Language Models (LLMs like GPT-3)
on them.
Installation
Fondant can be installed using pip:
pip install fondant
Usage
Fondant is built upon KubeFlow, a cloud-agnostic framework built by Google to orchestrate machine learning workflows on Kubernetes. An important aspect of KubeFlow are pipelines, which consist of a set of components being executed, one after the other. This typically involves transforming data and optionally training a machine learning model on it. Check out this page if you want to learn more about KubeFlow pipelines and components.
Fondant offers ready-made components and helper functions that serve as boilerplate which you can use to speed up the creation of KubeFlow pipelines. To implement your own component, simply overwrite one of the components available in Fondant. In the example below, we leverage the PandasTransformComponent
and overwrite its transform
method.
import pandas as pd
from fondant.components.pandas_components import PandasTransformComponent, PandasDataset, PandasDatasetDraft
class MyFirstTransform(PandasTransformComponent):
@classmethod
def transform(cls, data: PandasDataset, extra_args: Optional[Dict] = None) -> PandasDatasetDraft:
# Reading data
index: List[str] = data.load_index()
my_data: Scanner = data.load("my_data_source")
# Transforming data
table: pa.Table = my_data.to_table()
df: pd.DataFrame = table.to_pandas()
# ...
transformed_table = pa.Table.from_pandas(df)
# Returning output.
return data.extend() \
.with_index(in) \
.with_data_source("my_transformed_data_source", \
Scanner.from_batches(table.to_batches())
Components zoo
Available components include:
- Non-distributed Pandas components:
fondant.components.pandas_components.{PandasTransformComponent, PandasLoaderComponent}
Planned components include:
- Spark-based components and base image.
- HuggingFace Datasets components.
With Kubeflow, it's possible to share and re-use components across different pipelines. To see an example, checkout this sample notebook that showcases how you can save and load a component.
Note that Google's AI Hub also contains components that you can easily re-use. Some interesting examples:
- Gather training data by querying BigQuery
- Bigquery to TFRecords converter
- Executing an Apache Beam Python job in Cloud Dataflow
- Submitting a Cloud ML training job as a pipeline step
- Deploying a trained model to Cloud Machine Learning Engine
- Batch predicting using Cloud Machine Learning Engine
Pipeline zoo
To do: add ready-made pipelines.
Examples
Example use cases of Fondant include:
- collect additional image-text pairs based on a few seed images and fine-tune Stable Diffusion
- filter an image-text dataset to only include "count" examples and fine-tune CLIP to improve its counting capabilities
Check out the examples folder for some illustrations.
Contributing
We use poetry and pre-commit to enable a smooth developer flow. Run the following commands to set up your development environment:
pip install poetry
poetry install
pre-commit install
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for fondant-0.2.dev0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 8412e99040d8e98a76b3f203bb1d27c2dde2106c7b497e8adae20b07d581aa61 |
|
MD5 | 8998e0dfeb16ea313ec66cfcf21e3f03 |
|
BLAKE2b-256 | 72440e0cae784fc53b79912f5ac24c4b7baba45cd786f46919e0106d291559be |