fondant

Fondant - Sweet data-centric foundation model fine-tuning

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Sweet data-centric foundation model fine-tuning
Explore the docs »

Fondant helps you create high quality datasets to train or fine-tune foundation models such as:

🎨 Stable Diffusion
📄 GPT-like Large Language Models (LLMs)
🔎 CLIP
✂️ Segment Anything (SAM)
➕ And many more

🪤 Why Fondant?

Foundation models simplify inference by solving multiple tasks across modalities with a simple prompt-based interface. But what they've gained in the front, they've lost in the back. These models require enormous amounts of data, moving complexity towards data preparation, and leaving few parties able to train their own models.

We believe that innovation is a group effort, requiring collaboration. While the community has been building and sharing models, everyone is still building their data preparation from scratch. Fondant is the platform where we meet to build and share data preparation workflows.

Fondant offers a framework to build composable data preparation pipelines, with reusable components, optimized to handle massive datasets. Stop building from scratch, and start reusing components to:

Extend your data with public datasets
Generate new modalities using captioning, segmentation, translation, image generation, ...
Distill knowledge from existing foundation models
Filter out low quality data
Deduplicate data

And create high quality datasets to fine-tune your own foundation models.

(back to top)

💨 Getting Started

Anxious to get started? Here's is a step by step guide to get your first pipeline up and running.

🪄 Example pipelines

Curious to see what Fondant can do? Have a look at our example pipelines:

Fine-tuning ControlNet

Our example pipeline to generate data for ControlNet fine-tuning allows you to create models that you can control using inpainting, segmentation, and regeneration. All you need to get started is a set of prompts describing the type of images to generate.

For instance, using our ControlNet model fine-tuned on interior design images, allows you to generate the room of your dreams:

Input image	Output image

Want to try out the resulting model yourself, head over to our Hugging Face space!

Fine-tuning Stable Diffusion

Using our example pipeline to fine-tune Stable Diffusion allows you to create models that generate better images within a specific domain. All you need to get started is a small seed dataset of example images.

Eg. generating logos:

Stable Diffusion 1.5	Fine-tuned Stable Diffusion 1.5

Training Starcoder

Using our example pipeline to train StarCoder provides a starting point to create datasets for training code assistants.

(back to top)

🧩 Reusable components

Fondant comes with a library of reusable components, which can jumpstart your pipeline.

COMPONENT	DESCRIPTION
Data loading / writing
load_from_hf_hub	Load a dataset from the Hugging Face Hub
write_to_hf_hub	Write a dataset to the Hugging Face Hub
prompt_based_laion_retrieval	Retrieve images-text pairs from LAION using prompt similarity
embedding_based_laion_retrieval	Retrieve images-text pairs from LAION using embedding similarity
download_images	Download images from urls
Image processing
embed_images	Create embeddings for images using a model from the HF Hub
image_resolution_extraction	Extract the resolution from images
filter_image_resolution	Filter images based on their resolution
caption images	Generate captions for images using a model from the HF Hub
segment_images	Generate segmentation maps for images using a model from the HF Hub
image_cropping	Intelligently crop out image borders
Code processing
pii_redaction	Redact Personal Identifiable Information (PII)
filter_comments	Filter code based on code to comment ratio
filter_line_length	Filter code based on line length
Language processing	Coming soon
Clustering	Coming soon

(back to top)

⚒️ Installation

Fondant can be installed using pip:

pip install fondant

For the latest development version, you might want to install from source instead:

pip install git+https://github.com/ml6team/fondant.git

🧱 Deploying Fondant

There are 2 ways of using fondant:

Leveraging Kubeflow pipelines on any Kubernetes cluster. All Fondant needs is an url pointing to the Kubeflow pipeline host and an Object Storage provider (S3, GCS, etc) to store data produced in the pipeline between steps. We have compiled some references and created some scripts to get you started with setting up the required infrastructure.
Or locally by using docker compose. This way is mainly aimed at helping you develop fondant pipelines and components faster by making it easier to run things on a smaller scale.

The same pipeline can be used in both variants allowing you to quickly develop and iterate using the local Docker Compose implementation and then using the power of Kubeflow pipelines to run a large scale pipeline.

(back to top)

👨‍💻 Usage

Pipeline

Fondant allows you to easily define data pipelines comprised of both reusable and custom components. The following pipeline for instance uses the reusable load_from_hf_hub component to load a dataset from the Hugging Face Hub and process it using a custom component:

from fondant.pipeline import ComponentOp, Pipeline, Client


def build_pipeline():
    pipeline = Pipeline(pipeline_name="example pipeline", base_path="fs://bucket")

    load_from_hub_op = ComponentOp.from_registry(
        name="load_from_hf_hub",
        arguments={"dataset_name": "lambdalabs/pokemon-blip-captions"},
    )
    pipeline.add_op(load_from_hub_op)

    custom_op = ComponentOp(
        component_dir="components/custom_component",
        arguments={
            "min_width": 600,
            "min_height": 600,
        },
    )
    pipeline.add_op(custom_op, dependencies=load_from_hub_op)

    return pipeline
    

if __name__ == "__main__":
    client = Client(host="https://kfp-host.com/")
    pipeline = build_pipeline()
    client.compile_and_run(pipeline=pipeline)

Component

To create a custom component, you first need to describe its contract as a yaml specification. It defines the data consumed and produced by the component and any arguments it takes.

name: Custom component
description: This is a custom component
image: custom_component:latest

consumes:
  images:
    fields:
      data:
        type: binary

produces:
  captions:
    fields:
      data:
        type: utf8

args:
  argument1:
    description: An argument passed to the component at runtime
    type: str
  argument2:
    description: Another argument passed to the component at runtime
    type: str

Once you have your component specification, all you need to do is implement a constructor and a single .transform method and Fondant will do the rest. You will get the data defined in your specification partition by partition as a Pandas dataframe.

import pandas as pd
from fondant.component import PandasTransformComponent
from fondant.executor import PandasTransformExecutor


class ExampleComponent(PandasTransformComponent):

    def __init__(self, *args, argument1, argument2) -> None:
        """
        Args:
            argumentX: An argument passed to the component
        """
        # Initialize your component here based on the arguments

    def transform(self, dataframe: pd.DataFrame) -> pd.DataFrame:
        """Implement your custom logic in this single method
        Args:
            dataframe: A Pandas dataframe containing the data
        Returns:
            A pandas dataframe containing the transformed data
        """

For more advanced use cases, you can use the DaskTransformComponent instead.

Running your pipeline

Once you have a pipeline you can easily run (and compile) it by using the built-in CLI:

fondant run pipeline.py --local

To see all available arguments you can check the fondant CLI help pages

fondant --help

Or for a subcommand:

fondant <subcommand> --help

(back to top)

🚧 Current state and roadmap

Fondant is currently in the alpha stage, offering a minimal viable interface. While you should expect to run into rough edges, the foundations are ready and Fondant should already be able to speed up your data preparation work.

The following topics are on our roadmap

Local pipeline execution
Non-linear pipeline DAGs
LLM-focused example pipelines and reusable components
Static validation, caching, and partial execution of pipelines
Data lineage and experiment tracking
Distributed execution, both on and off cluster
Support other dataframe libraries such as HF Datasets, Polars, Spark
Move reusable components into a decentralized component registry
Create datasets of copy-right free data for fine-tuning
Create reusable components for bias detection and mitigation

The roadmap and priority are defined based on community feedback. To provide input, you can join our discord or submit an idea in our Github Discussions.

For a detailed view on the roadmap and day to day development, you can check our github project board.

(back to top)

👭 Contributing

We welcome contributions of different kinds:


Issues	If you encounter any issue or bug, please submit them as a Github issue. You can also submit a pull request directly to fix any clear bugs.
Suggestions and feedback	If you have any suggestions or feedback, please reach out via our Discord server or Github Discussions!
Framework code contributions	If you want to help with the development of the Fondant framework, have a look at the issues marked with the good first issue label. If you want to add additional functionality, please submit an issue for it first.
Reusable components	Extending our library of reusable components is a great way to contribute. If you built a component which would be useful for other users, please submit a PR adding them to the components/ directory.
Example pipelines	If you built a pipeline with Fondant which can serve as an example to other users, please submit a PR adding them to the examples/ directory.

Environment setup

We use poetry and pre-commit to enable a smooth developer flow. Run the following commands to set up your development environment:

pip install poetry
poetry install
pre-commit install

(back to top)

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

1.0.0 yanked

Jan 31, 2024

Reason this release was yanked:

Released as 0.10.0 instead

0.12.1

Apr 22, 2024

0.12.0

Apr 17, 2024

0.12.dev0 pre-release

Apr 8, 2024

0.11.2

Apr 4, 2024

0.11.1

Mar 15, 2024

0.11.0

Mar 7, 2024

0.11.dev5 pre-release

Mar 5, 2024

0.11.dev4 pre-release

Feb 24, 2024

0.11.dev3 pre-release

Feb 21, 2024

0.11.dev2 pre-release

Feb 21, 2024

0.11.dev1 pre-release

Feb 20, 2024

0.10.1

Feb 5, 2024

0.10.0

Jan 31, 2024

0.10.dev0 pre-release

Jan 22, 2024

0.9.0

Jan 16, 2024

0.9.dev2 pre-release

Jan 15, 2024

0.9.dev1 pre-release

Jan 12, 2024

0.9.dev0 pre-release

Jan 11, 2024

0.8.0

Dec 13, 2023

0.8.dev6 pre-release

Dec 12, 2023

0.8.dev5 pre-release

Dec 12, 2023

0.8.dev4 pre-release

Dec 7, 2023

0.8.dev3 pre-release

Dec 4, 2023

0.8.dev2 pre-release

Nov 30, 2023

0.8.dev1 pre-release

Nov 27, 2023

0.8.dev0 pre-release

Nov 27, 2023

0.7.0

Nov 20, 2023

0.6.2

Oct 20, 2023

0.6.1

Oct 19, 2023

0.6.0 yanked

Oct 19, 2023

Reason this release was yanked:

Packaged older commit, use repackaged 0.6.1 instead.

0.5.0

Sep 25, 2023

This version

0.4.0

Sep 22, 2023

0.3.2

Aug 24, 2023

0.3.1

Aug 21, 2023

0.3.0

Aug 8, 2023

0.2.1

Jul 6, 2023

0.2.0

Jun 28, 2023

0.2.dev0 pre-release yanked

Apr 14, 2023

Reason this release was yanked:

Old development version

0.1.3

Jun 16, 2023

0.1.2

Jun 1, 2023

0.1.1

May 31, 2023

0.1.0

May 23, 2023

0.1.0.dev6 pre-release

May 23, 2023

0.1.dev0 pre-release

Apr 14, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fondant-0.4.0.tar.gz (48.5 kB view hashes)

Uploaded Sep 22, 2023 Source

Built Distribution

fondant-0.4.0-py3-none-any.whl (62.0 kB view hashes)

Uploaded Sep 22, 2023 Python 3

Hashes for fondant-0.4.0.tar.gz

Hashes for fondant-0.4.0.tar.gz
Algorithm	Hash digest
SHA256	`ec3198ff3affbba587c7fcb781bd14fba33e13c982a2fb01eee5de7d5cbfd71b`
MD5	`cffe9bdbfccb745186548f34c28ed983`
BLAKE2b-256	`5693d8821d17c6a6f33091692dff699857d7b74f0ec2d92e2e3823df26cb2869`

Hashes for fondant-0.4.0-py3-none-any.whl

Hashes for fondant-0.4.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`4bc97eb7aeb272da7221929d1a7f71754c34eabcf1b70ae3a88a32848c6dd44b`
MD5	`f11adfd323204065d4f330a9bae1390a`
BLAKE2b-256	`d82ae502104f0e29534f7f3e1d295fceb7106d8761d4f943d5554ca2e9a910e0`