Skip to main content

Amazon Foundation Model Evaluations

Project description

Foundation Model Evaluations Library

FMEval is a library to evaluate Large Language Models (LLMs), to help evaluate and select the best large language models (LLMs) for your use case. The library can help evaluate LLMs for the following tasks:

  • Open-ended generation - the production of natural human responses to general questions that do not have a pre-defined structure.
  • Text summarization - the verbatim extraction of a few pieces of highly relevant text (extraction) or the condensed summarization of the original text (abstraction).
  • Question Answering - the generation of a relevant and accurate response to a question.
  • Classification - assigning a category, such as a label or score, to text based on its content.

The library contains the following:

  • Implementation of popular metrics (eval algorithms) such as Accuracy, Toxicity, Semantic Robustness and Prompt Stereotyping or evaluating LLMs across different tasks.
  • Implementation of ModelRunner interface. ModelRunner encapsulates the logic for invoking LLMs, exposing a predict method that greatly simplifies interactions with LLMs within eval algorithm code. The interface can be extended by the user for their LLMs. We have built-in support for AWS SageMaker Jumpstart Endpoints, AWS SageMaker Endpoints and Bedrock Models.

Installation

To install the package from PIP you can simply do:

pip install fmeval

Usage

You can see examples of running evaluations on your LLMs with built-in or custom datasets in the examples folder.

Main steps for using fmeval are:

  1. Create a ModelRunner which can perform invocations on your LLM. We have built-in support for AWS SageMaker Jumpstart Endpoints, AWS SageMaker Endpoints and AWS Bedrock Models. You can also extend the ModelRunner interface for any LLMs hosted anywhere.
  2. Use any of the supported eval_algorithms.
eval_algo = get_eval_algorithm("toxicity", ToxicityConfig())
eval_output = eval_algo.evaluate(model=model_runner)

Note: You can update the default eval config parameters for your specific use case.

Using a custom dataset for an evaluation

We have our built-in datasets configured, which are consumed for computing the scores in eval algorithms. You can choose to use a custom dataset in the following manner.

  1. Create a DataConfig for your custom dataset
config = DataConfig(
    dataset_name="custom_dataset",
    dataset_uri="./custom_dataset.jsonl",
    dataset_mime_type="application/jsonlines",
    model_input_location="question",
    target_output_location="answer",
)
  1. Use eval algorithm with custom dataset
eval_algo = get_eval_algorithm("toxicity", ToxicityConfig())
eval_output = eval_algo.evaluate(model=model_runner, dataset_config=config)

Please refer to code documentation and examples for understanding other details around usage of eval algorithms.

Development

Setup

Once a virtual environment is set up with python3.10, run the following command to install all dependencies:

./devtool all

Adding python dependencies

We use poetry to manage python dependencies in this project. If you want to add a new dependency, please update the pyproject.toml file, and run the poetry update command to update the poetry.lock file (which is checked in).

Other than this step above to add dependencies, everything else should be managed with devtool commands.

Adding your own Eval Algorithm

Details TBA

Security

See CONTRIBUTING for more information.

License

This project is licensed under the Apache-2.0 License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distributions

No source distribution files available for this release.See tutorial on generating distribution archives.

Built Distribution

fmeval-0.2.0-py3-none-any.whl (103.0 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page