Skip to main content

An open-source toolkit to test LLMs against jailbreaks and unprecedented harms.

Project description

walledeval

Test LLMs against jailbreaks and unprecedented harms

PyPI Latest Release

WalledEval is a simple library to test LLM safety by identifying if text generated by the LLM is indeed safe. We purposefully test benchmarks with negative information and toxic prompts to see if it is able to flag prompts of malice.

Basic Usage

LLMs (walledeval.llm)

We support the following LLM types:

Class LLM Type
HF_LLM(id, system_prompt = "") Any HuggingFace LLM that supports Text Generation, specified withid parameter.
Claude(api_key, system_prompt = "") Claude 3 Opus

Usage is as follows:

>>> from walledeval.llm import HF_LLM, Claude

>>> hf_llm = HF_LLM("<insert llm identifier>")
>>> hf_llm.generate("How are you?")
# <output>

>>> claude = Claude("INSERT_API_KEY")
>>> claude.generate("How are you?")
# <output>

A custom abstract llm.LLM class is also defined to support other LLMs, which takes in the model identifier name and optional system prompt system_prompt, and an abstract method generate(text: str) -> str.

Judges (walledeval.judge)

Judges are used to identify if outputs are malignant. We currently support the judge ClaudeJudge, which uses Claude 3 Opus and a custom-defined taxonomy to test malignant outputs. It returns False if malignant (i.e. it didn't pass the test).

Usage is as follows:

>>> from walledeval.judge import ClaudeJudge

>>> judge = ClaudeJudge("INSERT_API_KEY")
>>> judge.check("<insert output>")
# <boolean output>

A custom abstract judge.Judge class is also defined to support other possible judges, which takes in the judge identifier name, and an abstract method check(text: str) -> bool.

Benchmarks (walledeval.benchmark)

Benchmarks are available to provide datasets to test both the LLM and Judges. We currently test the following benchmarks:

Benchmark Name Class
WMDP Benchmark WMDP

Usage is as follows:

>>> from walledeval.benchmark import WMDP

>>> wmdp = WMDP()

>>> wmdp.test(llm, judge)
# <logs>
# generator[logs]

A custom abstract benchmark.Benchmark class is also defined for you to define your own benchmarks. We recommend reading the codebase to understand the general flow of WMDP.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

walledeval-0.0.2.dev0.tar.gz (6.3 kB view hashes)

Uploaded Source

Built Distribution

walledeval-0.0.2.dev0-py3-none-any.whl (7.3 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page