Skip to main content

Massive Text Embedding Benchmark

Project description

Massive Text Embedding Benchmark

Massive Text Embedding Benchmark - Internal Development Git

Installation

pip install git+https://github.com/embeddings-benchmark/mteb.git

Minimal use

  • Using a python script:
from mteb import MTEB
from sentence_transformers import SentenceTransformer

model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=["Banking77Classification"])
evaluation.run(model)
  • Using CLI
mteb --available_tasks

mteb -m average_word_embeddings_komninos \
    -t Banking77Classification NFCorpus \
    --output_folder results \
    --verbosity 3

Advanced usage

Tasks selection

Tasks can be selected by providing the list of tasks that needs to be run, but also

  • by their types (e.g. "Clustering" or "Classification")
evaluation = MTEB(task_types=['Clustering', 'Retrieval']) # Only select clustering and retrieval tasks
  • by their categories e.g. "S2S" (sentence to sentence) or "P2P" (paragraph to paragraph)
evaluation = MTEB(task_categories=['S2S']) # Only select sentence2sentence tasks

You can also specify which languages to load for multilingual/crosslingual tasks like this:

from mteb.tasks.BitextMining import BUCCBitextMining

evaluation = MTEB(tasks=[
        BUCCBitextMining(langs=["de-en"]), # Only load "de-en" and fr-en" subsets of BUCC
        AmazonReviewsClassification(langs=["en", "fr"]) # Only load "en" and "fr" subsets of Amazon Reviews
])

Using a custom model

Models should implement the following interface, implementing an encode function taking as inputs a list of sentences, and returning a list of embeddings (embeddings can be np.array, torch.tensor, etc.).

class MyModel():
    def encode(self, sentences, batch_size=32):
        """ Returns a list of embeddings for the given sentences.
        Args:
            sentences (`List[str]`): List of sentences to encode
            batch_size (`int`): Batch size for the encoding

        Returns:
            `List[np.ndarray]` or `List[tensor]`: List of embeddings for the given sentences
        """
        pass

model = MyModel()
evaluation = MTEB(tasks=["Banking77Classification"])
evaluation.run(model)

Evaluating on a custom task

To add a new task, you need to implement a new class that inherits from the AbsTask associated with the task type (e.g. AbsTaskReranking for reranking tasks). You can find the supported task types in here.

from mteb import MTEB
from mteb.abstasks.AbsTaskReranking import AbsTaskReranking
from sentence_transformers import SentenceTransformer


class MindSmallReranking(AbsTaskReranking):
    @property
    def description(self):
        return {
            "name": "MindSmallReranking",
            "hf_hub_name": "mteb/mind_small",
            "description": "Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research",
            "reference": "https://www.microsoft.com/en-us/research/uploads/prod/2019/03/nl4se18LinkSO.pdf",
            "type": "Reranking",
            "category": "s2s",
            "eval_splits": ["validation"],
            "eval_langs": ["en"],
            "main_score": "map",
        }

model = SentenceTransformer("average_word_embeddings_komninos")
evaluation = MTEB(tasks=[MindSmallReranking()])
evaluation.run(model)

Note: for multilingual tasks, make sure your class also inherits from the MultilingualTask class like in this example.

Available tasks

Name Hub URL Description Type Category N° Languages
BUCC mteb/bucc-bitext-mining BUCC bitext mining dataset BitextMining s2s 4
Tatoeba mteb/tatoeba-bitext-mining 1,000 English-aligned sentence pairs for each language based on the Tatoeba corpus BitextMining s2s 112
AmazonCounterfactualClassification mteb/amazon_counterfactual A collection of Amazon customer reviews annotated for counterfactual detection pair classification. Classification s2s 4
AmazonPolarityClassification mteb/amazon_polarity Amazon Polarity Classification Dataset. Classification s2s 1
AmazonReviewsClassification mteb/amazon_reviews_multi A collection of Amazon reviews specifically designed to aid research in multilingual text classification. Classification s2s 6
Banking77Classification mteb/banking77 Dataset composed of online banking queries annotated with their corresponding intents. Classification s2s 1
EmotionClassification mteb/emotion Emotion is a dataset of English Twitter messages with six basic emotions: anger, fear, joy, love, sadness, and surprise. For more detailed information please refer to the paper. Classification s2s 1
ImdbClassification mteb/imdb Large Movie Review Dataset Classification p2p 1
MassiveIntentClassification mteb/amazon_massive_intent MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages Classification s2s 51
MassiveScenarioClassification mteb/amazon_massive_scenario MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages Classification s2s 51
MTOPDomainClassification mteb/mtop_domain MTOP: Multilingual Task-Oriented Semantic Parsing Classification s2s 6
MTOPIntentClassification mteb/mtop_intent MTOP: Multilingual Task-Oriented Semantic Parsing Classification s2s 6
ToxicConversationsClassification mteb/toxic_conversations_50k Collection of comments from the Civil Comments platform together with annotations if the comment is toxic or not. Classification s2s 1
TweetSentimentExtractionClassification mteb/tweet_sentiment_extraction Classification s2s 1
ArxivClusteringP2P mteb/arxiv-clustering-p2p Clustering of titles+abstract from arxiv. Clustering of 30 sets, either on the main or secondary category Clustering p2p 1
ArxivClusteringS2S mteb/arxiv-clustering-s2s Clustering of titles from arxiv. Clustering of 30 sets, either on the main or secondary category Clustering s2s 1
BiorxivClusteringP2P mteb/biorxiv-clustering-p2p Clustering of titles+abstract from biorxiv. Clustering of 10 sets, based on the main category. Clustering p2p 1
BiorxivClusteringS2S mteb/biorxiv-clustering-s2s Clustering of titles from biorxiv. Clustering of 10 sets, based on the main category. Clustering s2s 1
MedrxivClusteringP2P mteb/medrxiv-clustering-p2p Clustering of titles+abstract from medrxiv. Clustering of 10 sets, based on the main category. Clustering p2p 1
MedrxivClusteringS2S mteb/medrxiv-clustering-s2s Clustering of titles from medrxiv. Clustering of 10 sets, based on the main category. Clustering s2s 1
RedditClustering mteb/reddit-clustering Clustering of titles from 199 subreddits. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences. Clustering s2s 1
RedditClusteringP2P mteb/reddit-clustering-p2p Clustering of title+posts from reddit. Clustering of 10 sets of 50k paragraphs and 40 sets of 10k paragraphs. Clustering p2p 1
StackExchangeClustering mteb/stackexchange-clustering Clustering of titles from 121 stackexchanges. Clustering of 25 sets, each with 10-50 classes, and each class with 100 - 1000 sentences. Clustering s2s 1
StackExchangeClusteringP2P mteb/stackexchange-clustering-p2p Clustering of title+body from stackexchange. Clustering of 5 sets of 10k paragraphs and 5 sets of 5k paragraphs. Clustering p2p 1
TwentyNewsgroupsClustering mteb/twentynewsgroups-clustering Clustering of the 20 Newsgroups dataset (subject only). Clustering s2s 1
SprintDuplicateQuestions mteb/sprintduplicatequestions-pairclassification Duplicate questions from the Sprint community. PairClassification s2s 1
TwitterSemEval2015 mteb/twittersemeval2015-pairclassification Paraphrase-Pairs of Tweets from the SemEval 2015 workshop. PairClassification s2s 1
TwitterURLCorpus mteb/twitterurlcorpus-pairclassification Paraphrase-Pairs of Tweets. PairClassification s2s 1
AskUbuntuDupQuestions mteb/askubuntudupquestions-reranking AskUbuntu Question Dataset - Questions from AskUbuntu with manual annotations marking pairs of questions as similar or non-similar Reranking s2s 1
MindSmallReranking mteb/mind_small Microsoft News Dataset: A Large-Scale English Dataset for News Recommendation Research Reranking s2s 1
SciDocs mteb/scidocs-reranking Ranking of related scientific papers based on their title. Reranking s2s 1
StackOverflowDupQuestions mteb/stackoverflowdupquestions-reranking Stack Overflow Duplicate Questions Task for questions with the tags Java, JavaScript and Python Reranking s2s 1
ArguAna nan NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval Retrieval s2s 1
ClimateFEVER nan CLIMATE-FEVER is a dataset adopting the FEVER methodology that consists of 1,535 real-world claims regarding climate-change. Retrieval s2s 1
CQADupstackRetrieval nan CQADupStack: A Benchmark Data Set for Community Question-Answering Research Retrieval s2s 1
DBPedia nan DBpedia-Entity is a standard test collection for entity search over the DBpedia knowledge base Retrieval s2s 1
FEVER nan FEVER (Fact Extraction and VERification) consists of 185,445 claims generated by altering sentences extracted from Wikipedia and subsequently verified without knowledge of the sentence they were derived from. Retrieval s2s 1
FiQA2018 nan Financial Opinion Mining and Question Answering Retrieval s2s 1
HotpotQA nan HotpotQA is a question answering dataset featuring natural, multi-hop questions, with strong supervision for supporting facts to enable more explainable question answering systems. Retrieval s2s 1
MSMARCO nan MS MARCO is a collection of datasets focused on deep learning in search Retrieval s2s 1
MSMARCOv2 nan nan Retrieval s2s 1
NFCorpus nan NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval Retrieval s2s 1
NQ nan NFCorpus: A Full-Text Learning to Rank Dataset for Medical Information Retrieval Retrieval s2s 1
QuoraRetrieval nan QuoraRetrieval is based on questions that are marked as duplicates on the Quora platform. Given a question, find other (duplicate) questions. Retrieval s2s 1
SCIDOCS nan SciDocs, a new evaluation benchmark consisting of seven document-level tasks ranging from citation prediction, to document classification and recommendation. Retrieval s2s 1
SciFact nan nan Retrieval s2s 1
Touche2020 nan Touché Task 1: Argument Retrieval for Controversial Questions Retrieval s2s 1
TRECCOVID nan nan Retrieval s2s 1
BIOSSES mteb/biosses-sts Biomedical Semantic Similarity Estimation. STS s2s 1
SICK-R mteb/biosses-sts Semantic Textual Similarity SICK-R dataset as described here: STS s2s 1
STS12 mteb/sts12-sts SemEval STS 2012 dataset. STS s2s 1
STS13 mteb/sts13-sts SemEval STS 2013 dataset. STS s2s 1
STS14 mteb/sts14-sts SemEval STS 2014 dataset. Currently only the English dataset STS s2s 1
STS15 mteb/sts15-sts SemEval STS 2015 dataset STS s2s 1
STS16 mteb/sts16-sts SemEval STS 2016 dataset STS s2s 1
STS17 mteb/sts17-crosslingual-sts STS 2017 dataset STS s2s 11
STS22 mteb/sts22-crosslingual-sts SemEval 2022 Task 8: Multilingual News Article Similarity STS s2s 18
STSBenchmark mteb/stsbenchmark-sts Semantic Textual Similarity Benchmark (STSbenchmark) dataset. STS s2s 1
SummEval mteb/summeval Biomedical Semantic Similarity Estimation. Summarization s2s 1

Project details


Release history Release notifications | RSS feed

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

mteb-0.0.1.tar.gz (63.1 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page