PyTorch implementation of BERT score

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

BERTScore

Automatic Evaluation Metric described in the paper BERTScore: Evaluating Text Generation with BERT.

News:

Our arXiv paper has been updated to v2 with more experiments and analysis.
Updated to version 0.2.0
- Supporting BERT, XLM, XLNet, and RoBERTa models using huggingface's Transformers library
- Automatically picking the best model for a given language
- Automatically picking the layer based a model
- IDF is not set as default as we show in the new version that the improvement brought by importance weighting is not consistent

Authors:

*: Equal Contribution

Overview

BERTScore leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. It has been shown to correlate with human judgment on setence-level and system-level evaluation. Moreover, BERTScore computes precision, recall, and F1 measure, which can be useful for evaluating different language generation tasks.

For an illustration, BERTScore precision can be computed as

If you find this repo useful, please cite:

@article{bert-score,
  title={BERTScore: Evaluating Text Generation with BERT},
  author={Zhang, Tianyi and Kishore, Varsha and Wu, Felix and Weinberger, Kilian Q. and Artzi, Yoav.},
  journal={arXiv preprint arXiv:1904.09675},
  year={2019}
}

Installation

Python version >= 3.6
PyTorch version >= 1.0.0

Install from pip by

pip install bert-score

Install it from the source by:

git clone https://github.com/Tiiiger/bert_score
cd bert_score
pip install .

and you may test your installation by:

python -m unittest discover

Usage

Command Line Interface (CLI)

We provide a command line interface (CLI) of BERTScore as well as a python module. For the CLI, you can use it as follows:

To evaluate English text files:

We provide example inputs under ./example.

bert-score -r example/refs.txt -c example/hyps.txt --lang en

You will get the following output at the end:

roberta-large_L17_no-idf_version=0.2.0 BERT-P: 0.950530 BERT-R: 0.949223 BERT-F1: 0.949839

where "roberta-large_L17_no-idf_version=0.2.0" is the hashcode.

To evaluate text files in other languages:

We currently support the 104 languages in multilingual BERT (full list).

Please specify the two-letter abbrevation of the language. For instance, using --lang zh for Chinese text.

See more options by bert-score -h.

Python Function

For the python module, we provide a demo. Please refer to bert_score/score.py for more details.

Running BERTScore can be computationally intensive (because it uses BERT :p). Therefore, a GPU is usually necessary. If you don't have access to a GPU, you can try our demo on Google Colab

Practical Tips

Report the hash code (e.g., roberta-large_L17_no-idf_version=0.2.0) in your paper so that people know what setting you use. This is inspired by sacreBLEU.
Unlike BERT, RoBERTa uses GPT2-style tokenizer which creates addition " " tokens when there are multiple spaces appearing together. It is recommended to remove addition spaces by sent = re.sub(r' +', ' ', sent) or sent = re.sub(r'\s+', ' ', sent).
Using inverse document frequency (idf) on the reference sentences to weigh word importance may correlate better with human judgment. However, when the set of reference sentences become too small, the idf score would become inaccurate/invalid. We now make it optional. To use idf, please set --idf when using the CLI tool or idf=True when calling bert_score.score function.
When you are low on GPU memory, consider setting batch_size when calling bert_score.score function.
To use a particular model please set -m MODEL_TYPE when using the CLI tool or model_type=MODEL_TYPE when calling bert_score.score function.
We tune layer to use based on WMT16 metric evaluation dataset. You may use a different layer by setting -l LAYER or num_layers=LAYER

Default Behavior

Default Model

Language	Model
en	roberta-large
zh	bert-base-chinese
others	bert-base-multilingual-cased

Default Layers

Model	Best Layer
bert-base-uncased	9
bert-large-uncased	18
bert-base-cased-finetuned-mrpc	9
bert-base-multilingual-cased	9
bert-base-chinese	8
roberta-base	10
roberta-large	17
roberta-large-mnli	19
xlnet-base-cased	5
xlnet-large-cased	7
xlm-mlm-en-2048	7
xlm-mlm-100-1280	11

Acknowledgement

This repo wouldn't be possible without the awesome bert and pytorch-pretrained-BERT.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.3.13

Feb 20, 2023

0.3.12

Oct 14, 2022

0.3.11

Dec 10, 2021

0.3.10

Aug 5, 2021

0.3.9

Apr 17, 2021

0.3.8

Mar 3, 2021

0.3.7

Dec 6, 2020

0.3.6

Sep 3, 2020

0.3.5

Jul 17, 2020

0.3.4

Jun 10, 2020

0.3.3

May 10, 2020

0.3.2

Apr 18, 2020

0.3.1

Mar 5, 2020

0.3.0

Jan 14, 2020

0.2.3

Dec 22, 2019

0.2.2

Nov 30, 2019

0.2.1

Oct 29, 2019

This version

0.2.0

Oct 2, 2019

0.1.2

Apr 27, 2019

0.1.1

Apr 27, 2019

0.1.0

Apr 23, 2019

bert-score 0.2.0

Navigation

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Project description

BERTScore

News:

Authors:

Overview

Installation

Usage

Command Line Interface (CLI)

Python Function

Practical Tips

Default Behavior

Default Model

Default Layers

Acknowledgement

Project details

Verified details

Maintainers

Unverified details

Project links

GitHub Statistics

Meta

Classifiers

Release history Release notifications | RSS feed