Language Model Fine-tuning Toolkit
Project description
LMFT: Language Model Fine-Tuning
Language Model Fine-Tuning, for ChatGLM, BELLE, LLaMA fine-tuning.
lmft实现了ChatGLM-6B的模型finetune。
Guide
Feature
ChatGPT-6B fine-tuning
- Word2Vec:通过腾讯AI Lab开源的大规模高质量中文词向量数据(800万中文词轻量版) (文件名:light_Tencent_AILab_ChineseEmbedding.bin 密码: tawe)实现词向量检索,本项目实现了句子(词向量求平均)的word2vec向量表示
- SBERT(Sentence-BERT):权衡性能和效率的句向量表示模型,训练时通过有监督训练上层分类函数,文本匹配预测时直接句子向量做余弦,本项目基于PyTorch复现了Sentence-BERT模型的训练和预测
- CoSENT(Cosine Sentence):CoSENT模型提出了一种排序的损失函数,使训练过程更贴近预测,模型收敛速度和效果比Sentence-BERT更好,本项目基于PyTorch实现了CoSENT模型的训练和预测
Evaluation
文本生成
- 英文匹配数据集的评测结果:
Arch | Backbone | Model Name | English-STS-B |
---|---|---|---|
GloVe | glove | Avg_word_embeddings_glove_6B_300d | 61.77 |
BERT | bert-base-uncased | BERT-base-cls | 20.29 |
BERT | bert-base-uncased | BERT-base-first_last_avg | 59.04 |
BERT | bert-base-uncased | BERT-base-first_last_avg-whiten(NLI) | 63.65 |
SBERT | sentence-transformers/bert-base-nli-mean-tokens | SBERT-base-nli-cls | 73.65 |
SBERT | sentence-transformers/bert-base-nli-mean-tokens | SBERT-base-nli-first_last_avg | 77.96 |
SBERT | xlm-roberta-base | paraphrase-multilingual-MiniLM-L12-v2 | 84.42 |
CoSENT | bert-base-uncased | CoSENT-base-first_last_avg | 69.93 |
CoSENT | sentence-transformers/bert-base-nli-mean-tokens | CoSENT-base-nli-first_last_avg | 79.68 |
- 中文匹配数据集的评测结果:
Arch | Backbone | Model Name | ATEC | BQ | LCQMC | PAWSX | STS-B | Avg | QPS |
---|---|---|---|---|---|---|---|---|---|
CoSENT | hfl/chinese-macbert-base | CoSENT-macbert-base | 50.39 | 72.93 | 79.17 | 60.86 | 80.51 | 68.77 | 3008 |
CoSENT | Langboat/mengzi-bert-base | CoSENT-mengzi-base | 50.52 | 72.27 | 78.69 | 12.89 | 80.15 | 58.90 | 2502 |
CoSENT | bert-base-chinese | CoSENT-bert-base | 49.74 | 72.38 | 78.69 | 60.00 | 80.14 | 68.19 | 2653 |
SBERT | bert-base-chinese | SBERT-bert-base | 46.36 | 70.36 | 78.72 | 46.86 | 66.41 | 61.74 | 3365 |
SBERT | hfl/chinese-macbert-base | SBERT-macbert-base | 47.28 | 68.63 | 79.42 | 55.59 | 64.82 | 63.15 | 2948 |
CoSENT | hfl/chinese-roberta-wwm-ext | CoSENT-roberta-ext | 50.81 | 71.45 | 79.31 | 61.56 | 81.13 | 68.85 | - |
SBERT | hfl/chinese-roberta-wwm-ext | SBERT-roberta-ext | 48.29 | 69.99 | 79.22 | 44.10 | 72.42 | 62.80 | - |
- 本项目release模型的中文匹配评测结果:
Arch | Backbone | Model Name | ATEC | BQ | LCQMC | PAWSX | STS-B | Avg | QPS |
---|---|---|---|---|---|---|---|---|---|
Word2Vec | word2vec | w2v-light-tencent-chinese | 20.00 | 31.49 | 59.46 | 2.57 | 55.78 | 33.86 | 23769 |
SBERT | xlm-roberta-base | paraphrase-multilingual-MiniLM-L12-v2 | 18.42 | 38.52 | 63.96 | 10.14 | 78.90 | 41.99 | 3138 |
CoSENT | hfl/chinese-macbert-base | shibing624/lmft-base-chinese | 31.93 | 42.67 | 70.16 | 17.21 | 79.30 | 48.25 | 3008 |
Demo
Official Demo: https://www.mulanai.com/product/short_text_sim/
HuggingFace Demo: https://huggingface.co/spaces/shibing624/lmft
run example: examples/gradio_demo.py to see the demo:
python examples/gradio_demo.py
Install
pip install -U lmft
or
pip install -r requirements.txt
git clone https://github.com/shibing624/lmft.git
cd lmft
pip install --no-deps .
Usage
文本生成
example: examples/computing_embeddings_demo.py
import sys
sys.path.append('..')
from lmft import ChatGpt
def compute_emb(model):
# Embed a list of sentences
sentences = [
'卡',
'银行卡',
'The quick brown fox jumps over the lazy dog.'
]
sentence_embeddings = model.encode(sentences)
print(type(sentence_embeddings), sentence_embeddings.shape)
# The result is a list of sentence embeddings as numpy arrays
for sentence, embedding in zip(sentences, sentence_embeddings):
print("Sentence:", sentence)
print("Embedding shape:", embedding.shape)
print("Embedding head:", embedding[:10])
print()
if __name__ == "__main__":
t2v_model = ChatGpt("shibing624/lmft-base-chinese")
compute_emb(t2v_model)
output:
<class 'numpy.ndarray'> (7, 768)
Sentence: 卡
Embedding shape: (768,)
Sentence: 银行卡
Embedding shape: (768,)
...
dataset
Contact
- Issue(建议):
- 邮件我:xuming: xuming624@qq.com
- 微信我:加我微信号:xuming624, 备注:姓名-公司-NLP 进NLP交流群。
Citation
如果你在研究中使用了lmft,请按如下格式引用:
APA:
Xu, M. lmft: Lanauge Model Fine-Tuning toolkit (Version 1.1.2) [Computer software]. https://github.com/shibing624/lmft
BibTeX:
@misc{lmft,
author = {Xu, Ming},
title = {lmft: Language Model Fine-Tuning toolkit},
year = {2023},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/shibing624/lmft}},
}
License
授权协议为 The Apache License 2.0,可免费用做商业用途。请在产品说明中附加lmft的链接和授权协议。
Contribute
项目代码还很粗糙,如果大家对代码有所改进,欢迎提交回本项目,在提交之前,注意以下两点:
- 在
tests
添加相应的单元测试 - 使用
python -m pytest -v
来运行所有单元测试,确保所有单测都是通过的
之后即可提交PR。
Reference
Project details
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
lmft-0.1.0.tar.gz
(14.0 kB
view hashes)