fmeval

Amazon Foundation Model Evaluations

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 1 - Planning
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Project description

Foundation Model Evaluations Library

fmeval is a library to evaluate Large Language Models (LLMs) in order to help select the best LLM for your use case. The library evaluates LLMs for the following tasks:

Open-ended generation - The production of natural human responses to text that does not have a pre-defined structure.
Text summarization - The generation of a condensed summary retaining the key information contained in a longer text.
Question Answering - The generation of a relevant and accurate response to an answer.
Classification - Assigning a category, such as a label or score to text, based on its content.

The library contains

Algorithms to evaluate LLMs for Accuracy, Toxicity, Semantic Robustness and Prompt Stereotyping across different tasks.
Implementations of the ModelRunner interface. ModelRunner encapsulates the logic for invoking different types of LLMs, exposing a predict method to simplify interactions with LLMs within the eval algorithm code. We have built-in support for Amazon SageMaker Endpoints and JumpStart models. The user can extend the interface for their own model classes by implementing the predict method.

Installation

fmeval is developed under python3.10. To install the package, simply run:

pip install fmeval

Usage

You can see examples of running evaluations on your LLMs with built-in or custom datasets in the examples folder.

The main steps for using fmeval are:

Create a ModelRunner which can perform invocation on your LLM. fmeval provides built-in support for Amazon SageMaker Endpoints and JumpStart LLMs. You can also extend the ModelRunner interface for any LLMs hosted anywhere.
Use any of the supported eval_algorithms.

For example,

from fmeval.eval_algorithms.toxicity import Toxicity, ToxicityConfig

eval_algo = Toxicity(ToxicityConfig())
eval_output = eval_algo.evaluate(model=model_runner)

Note: You can update the default eval config parameters for your specific use case.

Using a custom dataset for an evaluation

We have our built-in datasets configured, which are consumed for computing the scores in eval algorithms. You can choose to use a custom dataset in the following manner.

Create a DataConfig for your custom dataset

config = DataConfig(
    dataset_name="custom_dataset",
    dataset_uri="./custom_dataset.jsonl",
    dataset_mime_type="application/jsonlines",
    model_input_location="question",
    target_output_location="answer",
)

Use an eval algorithm with a custom dataset

eval_algo = Toxicity(ToxicityConfig())
eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config)

Please refer to the developer guide and examples for more details around the usage of eval algorithms.

Troubleshooting

Users running fmeval on a Windows machine may encounter the error OSError: [Errno 0] AssignProcessToJobObject() failed when fmeval internally calls ray.init(). This OS error is a known Ray issue, and is detailed here. Multiple users have reported that installing Python from the official Python website rather than the Microsoft store fixes this issue. You can view more details on limitations of running Ray on Windows on Ray's webpage.
If you run into the error error: can't find Rust compiler while installing fmeval on a Mac, please try running the steps below.

curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh
rustup install 1.72.1
rustup default 1.72.1-aarch64-apple-darwin
rustup toolchain remove stable-aarch64-apple-darwin
rm -rf $HOME/.rustup/toolchains/stable-aarch64-apple-darwin
mv $HOME/.rustup/toolchains/1.72.1-aarch64-apple-darwin $HOME/.rustup/toolchains/stable-aarch64-apple-darwin

If you run into out of memory (OOM) errors, especially while running evaluations that use LLMs as evaluators like toxicity and summarization accuracy, it is likely that your machine does not have enough memory to load the evaluator models. By default, femval loads multiple copies of the model into memory to maximize parallelization, where the exact number depends on the number of cores on the machine. To reduce the number of models that get loaded in parallel, you can set the environment variable PARALLELIZATION_FACTOR to a value that suits your machine.

Development

Setup and the use of `devtool`

Once you have created a virtual environment with python3.10, run the following command to set up the development environment:

./devtool install_deps_dev
./devtool install_deps
./devtool all

Before submitting a PR, rerun ./devtool all for testing and linting. It should run without errors.

Adding python dependencies

We use poetry to manage python dependencies in this project. If you want to add a new dependency, please update the pyproject.toml file, and run the poetry update command to update the poetry.lock file (which is checked in).

Other than this step to add dependencies, use devtool commands for installing dependencies, linting and testing. Execute the command ./devtool without any arguments to see a list of available options.

Adding your own Eval Algorithm

Details TBA

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 1 - Planning
Intended Audience
- Developers
License
- OSI Approved :: Apache Software License
Natural Language
- English
Programming Language

Release history Release notifications | RSS feed

1.0.3

May 10, 2024

1.0.2

Apr 25, 2024

1.0.1

Apr 17, 2024

This version

1.0.0

Mar 29, 2024

0.4.0

Feb 21, 2024

0.3.0

Dec 13, 2023

0.2.1

Dec 7, 2023

0.2.0

Nov 29, 2023

0.1.0 yanked

Nov 3, 2023

Reason this release was yanked:

initial version

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

fmeval-1.0.0-py3-none-any.whl (129.7 kB view hashes)

Uploaded Mar 29, 2024 Python 3

Hashes for fmeval-1.0.0-py3-none-any.whl

Hashes for fmeval-1.0.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b075ab8870ec1fe7a8e6bed025a9da1e2e0d8d11ee194d1f0750736c76e8bc25`
MD5	`f5b3e8ab09739c500d89a8a0e6eeb65c`
BLAKE2b-256	`dd42c040db8017984b08ad96e149c0de9cb1c297ff69a27ad32a3238a9a0c5fd`

fmeval 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Foundation Model Evaluations Library

Installation

Usage

Using a custom dataset for an evaluation

Troubleshooting

Development

Setup and the use of `devtool`

Adding python dependencies

Adding your own Eval Algorithm

Security

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

fmeval 1.0.0

Navigation

Verified details

Maintainers

Unverified details

Meta

Classifiers

Project description

Foundation Model Evaluations Library

Installation

Usage

Using a custom dataset for an evaluation

Troubleshooting

Development

Setup and the use of devtool

Adding python dependencies

Adding your own Eval Algorithm

Security

License

Project details

Verified details

Maintainers

Unverified details

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distributions

Built Distribution

Setup and the use of `devtool`