Massive Text Embedding Benchmark

These details have been verified by PyPI

Maintainers

KennethEnevoldsen Muennighoff nouamanetazi nreimers

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Developers
- Information Technology
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python

Project description

Massive Text Embedding Benchmark

Massive Text Embedding Benchmark - Internal Development Git

Installation

pip install git+https://github.com/embeddings-benchmark/mteb.git

Minimal use

Using a python script:

from mteb import MTEB
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=["Banking77Classification"])
evaluation.run(model)

Using CLI

mteb --available_tasks

mteb -m average_word_embeddings_komninos \
    -t Banking77Classification NFCorpus \
    --output_folder results \
    --verbosity 3

Advanced usage

Tasks selection

Tasks can be selected by providing the list of tasks that needs to be run, but also

by their types (e.g. "Clustering" or "Classification")

evaluation = MTEB(task_types=['Clustering', 'Retrieval']) # Only select clustering and retrieval tasks

by their categories e.g. "S2S" (sentence to sentence) or "P2P" (paragraph to paragraph)

evaluation = MTEB(task_categories=['S2S']) # Only select sentence2sentence tasks

You can also specify which languages to load for multilingual/crosslingual tasks like this:

from mteb.tasks.BitextMining import BUCCBitextMining

evaluation = MTEB(tasks=[
        BUCCBitextMining(langs=["de-en"]), # Only load "de-en" and fr-en" subsets of BUCC
        AmazonReviewsClassification(langs=["en", "fr"]) # Only load "en" and "fr" subsets of Amazon Reviews
])

Using a custom model

Models should implement the following interface, implementing an encode function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be np.array, torch.tensor, etc.).

class MyModel():
    def encode(self, sentences, batch_size=32):
        """ Returns a list of embeddings for the given sentences.
        Args:
            sentences (`List[str]`): List of sentences to encode
            batch_size (`int`): Batch size for the encoding

        Returns:
            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
        """
        pass

model = MyModel()
evaluation = MTEB(tasks=["Banking77Classification"])
evaluation.run(model)

Evaluating on a custom task

To add a new task, you need to implement a new class that inherits from the AbsTask associated with the task type (e.g. AbsTaskReranking for reranking tasks). You can find the supported task types in here.

from mteb import MTEB
from mteb.abstasks.AbsTaskReranking import AbsTaskReranking
from sentence_transformers import SentenceTransformer


class MindSmallReranking(AbsTaskReranking):
    @property
    def description(self):
        return {
            "name": "MindSmallReranking",
            "hf_hub_name": "mteb/mind_small",
            "description": "Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research",
            "reference": "https://www.microsoft.com/en-us/research/uploads/prod/2019/03/nl4se18LinkSO.pdf",
            "type": "Reranking",
            "category": "s2s",
            "eval_splits": ["validation"],
            "eval_langs": ["en"],
            "main_score": "map",
        }

model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=[MindSmallReranking()])
evaluation.run(model)

Note: for multilingual tasks, make sure your class also inherits from the MultilingualTask class like in this example.

Available tasks

Name	Hub URL	Description	Type	Category	N° Languages
BUCC	mteb/bucc-bitext-mining	BUCC bitext mining dataset	BitextMining	s2s	4
Tatoeba	mteb/tatoeba-bitext-mining	1,000 English-aligned sentence pairs for each language based on the Tatoeba corpus	BitextMining	s2s	112
AmazonCounterfactualClassification	mteb/amazon_counterfactual	A collection of Amazon customer reviews annotated for counterfactual detection pair classification.	Classification	s2s	4
AmazonPolarityClassification	mteb/amazon_polarity	Amazon Polarity Classification Dataset.	Classification	s2s	1
AmazonReviewsClassification	mteb/amazon_reviews_multi	A collection of Amazon reviews specifically designed to aid research in multilingual text classification.	Classification	s2s	6
Banking77Classification	mteb/banking77	Dataset composed of online banking queries annotated with their corresponding intents.	Classification	s2s	1
EmotionClassification	mteb/emotion	Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper.	Classification	s2s	1
ImdbClassification	mteb/imdb	Large Movie Review Dataset	Classification	p2p	1
MassiveIntentClassification	mteb/amazon_massive_intent	MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages	Classification	s2s	51
MassiveScenarioClassification	mteb/amazon_massive_scenario	MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages	Classification	s2s	51
MTOPDomainClassification	mteb/mtop_domain	MTOP: Multilingual Task-Oriented Semantic Parsing	Classification	s2s	6
MTOPIntentClassification	mteb/mtop_intent	MTOP: Multilingual Task-Oriented Semantic Parsing	Classification	s2s	6
ToxicConversationsClassification	mteb/toxic_conversations_50k	Collection of comments from the Civil Comments platform together with annotations if the comment is toxic or not.	Classification	s2s	1
TweetSentimentExtractionClassification	mteb/tweet_sentiment_extraction		Classification	s2s	1
ArxivClusteringP2P	mteb/arxiv-clustering-p2p	Clustering of titles+abstract from arxiv. Clustering of 30 sets, either on the main or secondary category	Clustering	p2p	1
ArxivClusteringS2S	mteb/arxiv-clustering-s2s	Clustering of titles from arxiv. Clustering of 30 sets, either on the main or secondary category	Clustering	s2s	1
BiorxivClusteringP2P	mteb/biorxiv-clustering-p2p	Clustering of titles+abstract from biorxiv. Clustering of 10 sets, based on the main category.	Clustering	p2p	1
BiorxivClusteringS2S	mteb/biorxiv-clustering-s2s	Clustering of titles from biorxiv. Clustering of 10 sets, based on the main category.	Clustering	s2s	1
MedrxivClusteringP2P	mteb/medrxiv-clustering-p2p	Clustering of titles+abstract from medrxiv. Clustering of 10 sets, based on the main category.	Clustering	p2p	1
MedrxivClusteringS2S	mteb/medrxiv-clustering-s2s	Clustering of titles from medrxiv. Clustering of 10 sets, based on the main category.	Clustering	s2s	1
RedditClustering	mteb/reddit-clustering	Clustering of titles from 199 subreddits. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences.	Clustering	s2s	1
RedditClusteringP2P	mteb/reddit-clustering-p2p	Clustering of title+posts from reddit. Clustering of 10 sets of 50k paragraphs and 40 sets of 10k paragraphs.	Clustering	p2p	1
StackExchangeClustering	mteb/stackexchange-clustering	Clustering of titles from 121 stackexchanges. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences.	Clustering	s2s	1
StackExchangeClusteringP2P	mteb/stackexchange-clustering-p2p	Clustering of title+body from stackexchange. Clustering of 5 sets of 10k paragraphs and 5 sets of 5k paragraphs.	Clustering	p2p	1
TwentyNewsgroupsClustering	mteb/twentynewsgroups-clustering	Clustering of the 20 Newsgroups dataset (subject only).	Clustering	s2s	1
SprintDuplicateQuestions	mteb/sprintduplicatequestions-pairclassification	Duplicate questions from the Sprint community.	PairClassification	s2s	1
TwitterSemEval2015	mteb/twittersemeval2015-pairclassification	Paraphrase-Pairs of Tweets from the SemEval 2015 workshop.	PairClassification	s2s	1
TwitterURLCorpus	mteb/twitterurlcorpus-pairclassification	Paraphrase-Pairs of Tweets.	PairClassification	s2s	1
AskUbuntuDupQuestions	mteb/askubuntudupquestions-reranking	AskUbuntu Question Dataset - Questions from AskUbuntu with manual annotations marking pairs of questions as similar or non-similar	Reranking	s2s	1
MindSmallReranking	mteb/mind_small	Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research	Reranking	s2s	1
SciDocs	mteb/scidocs-reranking	Ranking of related scientific papers based on their title.	Reranking	s2s	1
StackOverflowDupQuestions	mteb/stackoverflowdupquestions-reranking	Stack Overflow Duplicate Questions Task for questions with the tags Java, JavaScript and Python	Reranking	s2s	1
ArguAna	nan	NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval	Retrieval	s2s	1
ClimateFEVER	nan	CLIMATE-FEVER is a dataset adopting the FEVER methodology that consists of 1,535 real-world claims regarding climate-change.	Retrieval	s2s	1
CQADupstackRetrieval	nan	CQADupStack: A Benchmark Data Set for Community Question-Answering Research	Retrieval	s2s	1
DBPedia	nan	DBpedia-Entity is a standard test collection for entity search over the DBpedia knowledge base	Retrieval	s2s	1
FEVER	nan	FEVER (Fact Extraction and VERification) consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from.	Retrieval	s2s	1
FiQA2018	nan	Financial Opinion Mining and Question Answering	Retrieval	s2s	1
HotpotQA	nan	HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems.	Retrieval	s2s	1
MSMARCO	nan	MS MARCO is a collection of datasets focused on deep learning in search	Retrieval	s2s	1
MSMARCOv2	nan	nan	Retrieval	s2s	1
NFCorpus	nan	NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval	Retrieval	s2s	1
NQ	nan	NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval	Retrieval	s2s	1
QuoraRetrieval	nan	QuoraRetrieval is based on questions that are marked as duplicates on the Quora platform. Given a question, find other (duplicate) questions.	Retrieval	s2s	1
SCIDOCS	nan	SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation.	Retrieval	s2s	1
SciFact	nan	nan	Retrieval	s2s	1
Touche2020	nan	Touché Task 1: Argument Retrieval for Controversial Questions	Retrieval	s2s	1
TRECCOVID	nan	nan	Retrieval	s2s	1
BIOSSES	mteb/biosses-sts	Biomedical Semantic Similarity Estimation.	STS	s2s	1
SICK-R	mteb/biosses-sts	Semantic Textual Similarity SICK-R dataset as described here:	STS	s2s	1
STS12	mteb/sts12-sts	SemEval STS 2012 dataset.	STS	s2s	1
STS13	mteb/sts13-sts	SemEval STS 2013 dataset.	STS	s2s	1
STS14	mteb/sts14-sts	SemEval STS 2014 dataset. Currently only the English dataset	STS	s2s	1
STS15	mteb/sts15-sts	SemEval STS 2015 dataset	STS	s2s	1
STS16	mteb/sts16-sts	SemEval STS 2016 dataset	STS	s2s	1
STS17	mteb/sts17-crosslingual-sts	STS 2017 dataset	STS	s2s	11
STS22	mteb/sts22-crosslingual-sts	SemEval 2022 Task 8: Multilingual News Article Similarity	STS	s2s	18
STSBenchmark	mteb/stsbenchmark-sts	Semantic Textual Similarity Benchmark (STSbenchmark) dataset.	STS	s2s	1
SummEval	mteb/summeval	Biomedical Semantic Similarity Estimation.	Summarization	s2s	1

Project details

These details have been verified by PyPI

Maintainers

KennethEnevoldsen Muennighoff nouamanetazi nreimers

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Environment
- Console
Intended Audience
- Developers
- Information Technology
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
- Python

Release history Release notifications | RSS feed

1.11.12

May 22, 2024

1.11.11

May 22, 2024

1.11.10

May 22, 2024

1.11.9

May 22, 2024

1.11.8

May 22, 2024

1.11.7

May 22, 2024

1.11.6

May 21, 2024

1.11.5

May 21, 2024

1.11.4

May 21, 2024

1.11.3

May 21, 2024

1.11.2

May 21, 2024

1.11.1

May 21, 2024

1.11.0

May 20, 2024

1.10.18

May 20, 2024

1.10.17

May 20, 2024

1.10.16

May 20, 2024

1.10.15

May 19, 2024

1.10.14

May 19, 2024

1.10.13

May 18, 2024

1.10.12

May 18, 2024

1.10.11

May 18, 2024

1.10.10

May 17, 2024

1.10.9

May 17, 2024

1.10.8

May 17, 2024

1.10.7

May 17, 2024

1.10.6

May 17, 2024

1.10.5

May 16, 2024

1.10.4

May 16, 2024

1.10.3

May 16, 2024

1.10.2

May 16, 2024

1.10.1

May 15, 2024

1.10.0

May 14, 2024

1.9.3

May 14, 2024

1.9.2

May 14, 2024

1.9.1

May 14, 2024

1.9.0

May 13, 2024

1.8.11

May 12, 2024

1.8.10

May 12, 2024

1.8.9

May 11, 2024

1.8.8

May 11, 2024

1.8.7

May 9, 2024

1.8.6

May 8, 2024

1.8.5

May 8, 2024

1.8.4

May 8, 2024

1.8.3

May 7, 2024

1.8.2

May 6, 2024

1.8.1

May 6, 2024

1.8.0

May 5, 2024

1.7.64

May 5, 2024

1.7.63

May 5, 2024

1.7.62

May 5, 2024

1.7.61

May 5, 2024

1.7.60

May 4, 2024

1.7.59

May 4, 2024

1.7.58

May 2, 2024

1.7.57

May 2, 2024

1.7.56

May 2, 2024

1.7.55

May 2, 2024

1.7.54

May 2, 2024

1.7.53

May 2, 2024

1.7.52

May 1, 2024

1.7.51

May 1, 2024

1.7.50

Apr 30, 2024

1.7.49

Apr 30, 2024

1.7.48

Apr 30, 2024

1.7.47

Apr 30, 2024

1.7.46

Apr 29, 2024

1.7.45

Apr 29, 2024

1.7.44

Apr 29, 2024

1.7.43

Apr 29, 2024

1.7.42

Apr 29, 2024

1.7.41

Apr 28, 2024

1.7.40

Apr 28, 2024

1.7.39

Apr 28, 2024

1.7.38

Apr 27, 2024

1.7.37

Apr 27, 2024

1.7.36

Apr 26, 2024

1.7.35

Apr 26, 2024

1.7.34

Apr 26, 2024

1.7.33

Apr 26, 2024

1.7.32

Apr 25, 2024

1.7.31

Apr 25, 2024

1.7.30

Apr 25, 2024

1.7.29

Apr 25, 2024

1.7.28

Apr 25, 2024

1.7.27

Apr 24, 2024

1.7.26

Apr 24, 2024

1.7.25

Apr 24, 2024

1.7.24

Apr 24, 2024

1.7.23

Apr 24, 2024

1.7.22

Apr 24, 2024

1.7.21

Apr 24, 2024

1.7.20

Apr 24, 2024

1.7.19

Apr 24, 2024

1.7.18

Apr 24, 2024

1.7.17

Apr 23, 2024

1.7.16

Apr 23, 2024

1.7.15

Apr 23, 2024

1.7.14

Apr 23, 2024

1.7.13

Apr 23, 2024

1.7.12

Apr 23, 2024

1.7.11

Apr 23, 2024

1.7.10

Apr 23, 2024

1.7.9

Apr 23, 2024

1.7.8

Apr 23, 2024

1.7.7

Apr 23, 2024

1.7.6

Apr 22, 2024

1.7.5

Apr 22, 2024

1.7.4

Apr 21, 2024

1.7.3

Apr 21, 2024

1.7.2

Apr 21, 2024

1.7.1

Apr 21, 2024

1.7.0

Apr 20, 2024

1.6.38

Apr 20, 2024

1.6.37

Apr 20, 2024

1.6.36

Apr 19, 2024

1.6.35

Apr 19, 2024

1.6.34

Apr 19, 2024

1.6.33

Apr 19, 2024

1.6.32

Apr 19, 2024

1.6.31

Apr 19, 2024

1.6.30

Apr 19, 2024

1.6.29

Apr 19, 2024

1.6.28

Apr 19, 2024

1.6.27

Apr 19, 2024

1.6.26

Apr 19, 2024

1.6.25

Apr 18, 2024

1.6.24

Apr 18, 2024

1.6.23

Apr 18, 2024

1.6.22

Apr 18, 2024

1.6.21

Apr 18, 2024

1.6.20

Apr 18, 2024

1.6.19

Apr 18, 2024

1.6.18

Apr 18, 2024

1.6.17

Apr 18, 2024

1.6.16

Apr 17, 2024

1.6.15

Apr 17, 2024

1.6.14

Apr 17, 2024

1.6.13

Apr 17, 2024

1.6.12

Apr 17, 2024

1.6.11

Apr 16, 2024

1.6.10

Apr 15, 2024

1.6.9

Apr 15, 2024

1.6.8

Apr 15, 2024

1.6.7

Apr 15, 2024

1.6.6

Apr 15, 2024

1.6.5

Apr 15, 2024

1.6.4

Apr 15, 2024

1.6.3

Apr 14, 2024

1.6.2

Apr 12, 2024

1.6.1

Apr 11, 2024

1.6.0

Apr 10, 2024

1.5.6

Apr 10, 2024

1.5.5

Apr 9, 2024

1.5.4

Apr 8, 2024

1.5.3

Apr 8, 2024

1.5.2

Apr 4, 2024

1.5.1

Apr 3, 2024

1.5.0

Apr 2, 2024

1.4.1

Apr 1, 2024

1.4.0

Apr 1, 2024

1.3.4

Apr 1, 2024

1.3.3

Mar 31, 2024

1.3.2

Mar 29, 2024

1.3.1

Mar 26, 2024

1.2.0

Mar 6, 2024

1.1.2

Feb 16, 2024

1.1.1

Sep 20, 2023

1.1.0

Jul 31, 2023

1.0.2

Mar 28, 2023

1.0.1

Nov 29, 2022

1.0.0

Oct 17, 2022

0.9.1

Oct 13, 2022

This version

0.0.1

Jun 30, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mteb-0.0.1.tar.gz (63.1 kB view hashes)

Uploaded Jun 30, 2022 Source

Hashes for mteb-0.0.1.tar.gz

Hashes for mteb-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`af62d0c58bfcbb7bf46b5775f893bbe75b83836f82b78b284b9f213a84233da3`
MD5	`af17063e58b4ca43e837abc1e6397726`
BLAKE2b-256	`e10fca4a2c0e221f24169c335a99ec34f357ce612df318583f80096d1f3711b8`