Language Model Fine-tuning Toolkit

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

LMFT: Language Model Fine-Tuning

Language Model Fine-Tuning, for ChatGLM, BELLE, LLaMA fine-tuning.

lmft实现了ChatGLM-6B的模型finetune。

Guide

Feature
Evaluation
Demo
Install
Usage
Contact
Reference

Feature

ChatGPT-6B fine-tuning

Word2Vec：通过腾讯AI Lab开源的大规模高质量中文词向量数据（800万中文词轻量版） (文件名：light_Tencent_AILab_ChineseEmbedding.bin 密码: tawe）实现词向量检索，本项目实现了句子（词向量求平均）的word2vec向量表示
SBERT(Sentence-BERT)：权衡性能和效率的句向量表示模型，训练时通过有监督训练上层分类函数，文本匹配预测时直接句子向量做余弦，本项目基于PyTorch复现了Sentence-BERT模型的训练和预测
CoSENT(Cosine Sentence)：CoSENT模型提出了一种排序的损失函数，使训练过程更贴近预测，模型收敛速度和效果比Sentence-BERT更好，本项目基于PyTorch实现了CoSENT模型的训练和预测

Evaluation

文本生成

英文匹配数据集的评测结果：

Arch	Backbone	Model Name	English-STS-B
GloVe	glove	Avg_word_embeddings_glove_6B_300d	61.77
BERT	bert-base-uncased	BERT-base-cls	20.29
BERT	bert-base-uncased	BERT-base-first_last_avg	59.04
BERT	bert-base-uncased	BERT-base-first_last_avg-whiten(NLI)	63.65
SBERT	sentence-transformers/bert-base-nli-mean-tokens	SBERT-base-nli-cls	73.65
SBERT	sentence-transformers/bert-base-nli-mean-tokens	SBERT-base-nli-first_last_avg	77.96
SBERT	xlm-roberta-base	paraphrase-multilingual-MiniLM-L12-v2	84.42
CoSENT	bert-base-uncased	CoSENT-base-first_last_avg	69.93
CoSENT	sentence-transformers/bert-base-nli-mean-tokens	CoSENT-base-nli-first_last_avg	79.68

中文匹配数据集的评测结果：

Arch	Backbone	Model Name	ATEC	BQ	LCQMC	PAWSX	STS-B	Avg	QPS
CoSENT	hfl/chinese-macbert-base	CoSENT-macbert-base	50.39	72.93	79.17	60.86	80.51	68.77	3008
CoSENT	Langboat/mengzi-bert-base	CoSENT-mengzi-base	50.52	72.27	78.69	12.89	80.15	58.90	2502
CoSENT	bert-base-chinese	CoSENT-bert-base	49.74	72.38	78.69	60.00	80.14	68.19	2653
SBERT	bert-base-chinese	SBERT-bert-base	46.36	70.36	78.72	46.86	66.41	61.74	3365
SBERT	hfl/chinese-macbert-base	SBERT-macbert-base	47.28	68.63	79.42	55.59	64.82	63.15	2948
CoSENT	hfl/chinese-roberta-wwm-ext	CoSENT-roberta-ext	50.81	71.45	79.31	61.56	81.13	68.85	-
SBERT	hfl/chinese-roberta-wwm-ext	SBERT-roberta-ext	48.29	69.99	79.22	44.10	72.42	62.80	-

本项目release模型的中文匹配评测结果：

Arch	Backbone	Model Name	ATEC	BQ	LCQMC	PAWSX	STS-B	Avg	QPS
Word2Vec	word2vec	w2v-light-tencent-chinese	20.00	31.49	59.46	2.57	55.78	33.86	23769
SBERT	xlm-roberta-base	paraphrase-multilingual-MiniLM-L12-v2	18.42	38.52	63.96	10.14	78.90	41.99	3138
CoSENT	hfl/chinese-macbert-base	shibing624/lmft-base-chinese	31.93	42.67	70.16	17.21	79.30	48.25	3008

Demo

Official Demo: https://www.mulanai.com/product/short_text_sim/

HuggingFace Demo: https://huggingface.co/spaces/shibing624/lmft

run example: examples/gradio_demo.py to see the demo:

python examples/gradio_demo.py

Install

pip install -U lmft

pip install -r requirements.txt

git clone https://github.com/shibing624/lmft.git
cd lmft
pip install --no-deps .

Usage

文本生成

example: examples/computing_embeddings_demo.py

import sys

sys.path.append('..')
from lmft import ChatGpt


def compute_emb(model):
    # Embed a list of sentences
    sentences = [
        '卡',
        '银行卡',
        'The quick brown fox jumps over the lazy dog.'
    ]
    sentence_embeddings = model.encode(sentences)
    print(type(sentence_embeddings), sentence_embeddings.shape)

    # The result is a list of sentence embeddings as numpy arrays
    for sentence, embedding in zip(sentences, sentence_embeddings):
        print("Sentence:", sentence)
        print("Embedding shape:", embedding.shape)
        print("Embedding head:", embedding[:10])
        print()


if __name__ == "__main__":
    t2v_model = ChatGpt("shibing624/lmft-base-chinese")
    compute_emb(t2v_model)

output:

<class 'numpy.ndarray'> (7, 768)
Sentence: 卡
Embedding shape: (768,)

Sentence: 银行卡
Embedding shape: (768,)
 ...

dataset

Contact

Issue(建议)：
邮件我：xuming: xuming624@qq.com
微信我：加我微信号：xuming624, 备注：姓名-公司-NLP 进NLP交流群。

Citation

如果你在研究中使用了lmft，请按如下格式引用：

APA:

Xu, M. lmft: Lanauge Model Fine-Tuning toolkit (Version 1.1.2) [Computer software]. https://github.com/shibing624/lmft

BibTeX:

@misc{lmft,
  author = {Xu, Ming},
  title = {lmft: Language Model Fine-Tuning toolkit},
  year = {2023},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/shibing624/lmft}},
}

License

授权协议为 The Apache License 2.0，可免费用做商业用途。请在产品说明中附加lmft的链接和授权协议。

Contribute

项目代码还很粗糙，如果大家对代码有所改进，欢迎提交回本项目，在提交之前，注意以下两点：

在tests添加相应的单元测试
使用python -m pytest -v来运行所有单元测试，确保所有单测都是通过的

之后即可提交PR。

Reference

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.2.5

Apr 6, 2023

0.2.2

Mar 27, 2023

0.2.1

Mar 27, 2023

This version

0.1.0

Mar 23, 2023

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

lmft-0.1.0.tar.gz (14.0 kB view hashes)

Uploaded Mar 23, 2023 Source

Hashes for lmft-0.1.0.tar.gz

Hashes for lmft-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`a87b6228436dce973b7035924571897bc5cf48ab5f239359f69976573c229942`
MD5	`233bbce80ba0424780138a1c61606754`
BLAKE2b-256	`a27da82c0897d65172e82c178f53b5bf37ee88f1508e8e01d32ccdfae1e8efb1`