An open-source toolkit to test LLMs against jailbreaks and unprecedented harms.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

walledeval

Test LLMs against jailbreaks and unprecedented harms

WalledEval is a simple library to test LLM safety by identifying if text generated by the LLM is indeed safe. We purposefully test benchmarks with negative information and toxic prompts to see if it is able to flag prompts of malice.

Basic Usage

LLMs (`walledeval.llm`)

We support the following LLM types:

Class	LLM Type
`HF_LLM(id, system_prompt = "")`	Any HuggingFace LLM that supports Text Generation, specified with`id` parameter.
`Claude(api_key, system_prompt = "")`	Claude 3 Opus

Usage is as follows:

>>> from walledeval.llm import HF_LLM, Claude

>>> hf_llm = HF_LLM("<insert llm identifier>")
>>> hf_llm.generate("How are you?")
# <output>

>>> claude = Claude("INSERT_API_KEY")
>>> claude.generate("How are you?")
# <output>

A custom abstract llm.LLM class is also defined to support other LLMs, which takes in the model identifier name and optional system prompt system_prompt, and an abstract method generate(text: str) -> str.

Judges (`walledeval.judge`)

Judges are used to identify if outputs are malignant. We currently support the judge ClaudeJudge, which uses Claude 3 Opus and a custom-defined taxonomy to test malignant outputs. It returns False if malignant (i.e. it didn't pass the test).

Usage is as follows:

>>> from walledeval.judge import ClaudeJudge

>>> judge = ClaudeJudge("INSERT_API_KEY")
>>> judge.check("<insert output>")
# <boolean output>

A custom abstract judge.Judge class is also defined to support other possible judges, which takes in the judge identifier name, and an abstract method check(text: str) -> bool.

Benchmarks (`walledeval.benchmark`)

Benchmarks are available to provide datasets to test both the LLM and Judges. We currently test the following benchmarks:

Benchmark Name	Class
WMDP Benchmark	`WMDP`

Usage is as follows:

>>> from walledeval.benchmark import WMDP

>>> wmdp = WMDP()

>>> wmdp.test(llm, judge)
# <logs>
# generator[logs]

A custom abstract benchmark.Benchmark class is also defined for you to define your own benchmarks. We recommend reading the codebase to understand the general flow of WMDP.

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.2.dev0 pre-release

May 15, 2024

0.0.1.dev1 pre-release

May 15, 2024

0.0.1.dev0 pre-release

May 14, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

walledeval-0.0.2.dev0.tar.gz (6.3 kB view hashes)

Uploaded May 15, 2024 Source

Built Distribution

walledeval-0.0.2.dev0-py3-none-any.whl (7.3 kB view hashes)

Uploaded May 15, 2024 Python 3

Hashes for walledeval-0.0.2.dev0.tar.gz

Hashes for walledeval-0.0.2.dev0.tar.gz
Algorithm	Hash digest
SHA256	`f421c5d338edfee0b1f3977830ad5a0dd4172de1f4fcb769750bce08f64a3201`
MD5	`b606b96793b7c25defc84abea4fc20bb`
BLAKE2b-256	`0afa219f2890a74ceef3a3d68df1523f5a374c7f9f8ac642d379ad2d2829b6f6`

Hashes for walledeval-0.0.2.dev0-py3-none-any.whl

Hashes for walledeval-0.0.2.dev0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`816939a259a4035f65abf98fa23ccc0b1db884ddebaa5934cd75b0f210f9d60d`
MD5	`5313411db8d77e2641ef231d28bc9d86`
BLAKE2b-256	`2a67c78854b520f4ecb40c047abf8a1d51862d0ffe75a1e6dd6142d5dde48faa`

walledeval 0.0.2.dev0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

walledeval

Basic Usage

LLMs (`walledeval.llm`)

Judges (`walledeval.judge`)

Benchmarks (`walledeval.benchmark`)

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

walledeval 0.0.2.dev0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

walledeval

Basic Usage

LLMs (walledeval.llm)

Judges (walledeval.judge)

Benchmarks (walledeval.benchmark)

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

LLMs (`walledeval.llm`)

Judges (`walledeval.judge`)

Benchmarks (`walledeval.benchmark`)