Text to vector Tool, encode text

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

NERpy

🌈 Implementation of Named Entity Recognition using Python.

nerpy实现了Bert2Tag、Bert2Span等多种命名实体识别模型，并在标准数据集上比较了各模型的效果。

Guide

Feature
Evaluation
Install
Usage
Contact
Reference

Feature

命名实体识别模型

CoSENT(Cosine Sentence)：CoSENT模型提出了一种排序的损失函数，使训练过程更贴近预测，模型收敛速度和效果比Sentence-BERT更好，本项目基于PyTorch实现了CoSENT模型的训练和预测

Evaluation

实体识别

英文实体识别数据集的评测结果：

Arch	Backbone	Model Name	English-STS-B
CoSENT	sentence-transformers/bert-base-nli-mean-tokens	CoSENT-base-nli-first_last_avg	79.68

中文实体识别数据集的评测结果：

Arch	Backbone	Model Name	ATEC	BQ	LCQMC	PAWSX	STS-B	Avg	QPS
SBERT	hfl/chinese-roberta-wwm-ext	SBERT-roberta-ext	48.29	69.99	79.22	44.10	72.42	62.80	-

本项目release模型的中文匹配评测结果：

Arch	Backbone	Model Name	ATEC	BQ	LCQMC	PAWSX	STS-B	Avg	QPS
Word2Vec	word2vec	w2v-light-tencent-chinese	20.00	31.49	59.46	2.57	55.78	33.86	10283

说明：

结果值均使用F1
结果均只用该数据集的train训练，在test上评估得到的表现，没用外部数据
CoSENT-macbert-base模型达到同级别参数量SOTA效果，是用CoSENT方法训练，运行examples/training_sup_text_matching_model.py代码可在各数据集复现结果
各预训练模型均可以通过transformers调用，如MacBERT模型：--model_name hfl/chinese-macbert-base
中文匹配数据集下载链接见下方
中文匹配任务实验表明，pooling最优是first_last_avg，即 SentenceModel 的EncoderType.FIRST_LAST_AVG，其与EncoderType.MEAN的方法在预测效果上差异很小
QPS的GPU测试环境是Tesla V100，显存32GB

Demo

Official Demo: http://42.193.145.218/product/short_text_sim/

HuggingFace Demo: https://huggingface.co/spaces/shibing624/nerpy

Install

pip3 install torch # conda install pytorch
pip3 install -U nerpy

git clone https://github.com/shibing624/nerpy.git
cd nerpy
python3 setup.py install

数据集

中文实体识别数据集已经上传到huggingface datasets https://huggingface.co/datasets/shibing624/nli_zh

Usage

实体识别

基于pretrained model计算实体识别：

>>> from nerpy import Bert2Tag
>>> m = Bert2Tag()
>>> m.ner("University of California is located in California, United States")
{'LOCATION': ['California', 'United States'], 'ORGANIZATION': ['University of California']}

example: examples/ner_demo.py

import sys

sys.path.append('..')
from nerpy import Bert2Tag

def compute_ner(model):
    sentences = [
        '北京大学学生来到水立方观看水上芭蕾表演',
        'University of California is located in California, United States'
    ]
    entities = model.ner(sentences)
    print(entities)


if __name__ == "__main__":
    # 中文实体识别模型，支持fine-tune继续训练
    t2v_model = Bert2Tag("shibing624/nerpy-base-chinese")
    compute_ner(t2v_model)

    # 支持多语言的实体识别模型，英文实体识别任务推荐，支持fine-tune继续训练
    sbert_model = Bert2Tag("sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2")
    compute_ner(sbert_model)

output:

{'LOCATION': ['水立方'], 'ORGANIZATION': ['北京大学']}
{'LOCATION': ['California', 'United States'], 'ORGANIZATION': ['University of California']}

shibing624/nerpy-base-chinese模型是CoSENT方法在中文STS-B数据集训练得到的，模型已经上传到huggingface的模型库shibing624/nerpy-base-chinese，是nerpy.SentenceModel指定的默认模型，可以通过上面示例调用，或者如下所示用transformers库调用，模型自动下载到本机路径：~/.cache/huggingface/transformers

Usage (HuggingFace Transformers)

Without nerpy, you can use the model like this:

First, you pass your input through the transformer model, then you have to apply the right pooling-operation on-top of the contextualized word embeddings.

example: examples/use_origin_transformers_demo.py

import os
import torch
from transformers import AutoTokenizer, AutoModel

os.environ["KMP_DUPLICATE_LIB_OK"] = "TRUE"


# Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0]  # First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    return torch.sum(token_embeddings * input_mask_expanded, 1) / torch.clamp(input_mask_expanded.sum(1), min=1e-9)


# Load model from HuggingFace Hub
tokenizer = AutoTokenizer.from_pretrained('shibing624/nerpy-base-chinese')
model = AutoModel.from_pretrained('shibing624/nerpy-base-chinese')
sentences = ['北京大学学生来到水立方观看水上芭蕾表演']
# Tokenize sentences
encoded_input = tokenizer(sentences, padding=True, truncation=True, return_tensors='pt')

# Compute token embeddings
with torch.no_grad():
    model_output = model(**encoded_input)
print("Sentence Entities:")
print(model_output)

Bert2Tag model

Sentence-BERT文本匹配模型，表征式句向量表示方案

Network structure:

Training:

Inference:

Bert2Tag 监督模型

在中文STS-B数据集训练和评估MacBERT+Bert2Tag模型

example: examples/training_sup_text_matching_model.py

cd examples
python3 training_sup_text_matching_model.py --model_arch sentencebert --do_train --do_predict --num_epochs 10 --model_name hfl/chinese-macbert-base --output_dir ./outputs/STS-B-sbert

在英文STS-B数据集训练和评估BERT+SBERT模型

example: examples/training_sup_text_matching_model_en.py

cd examples
python3 training_sup_text_matching_model_en.py --model_arch sentencebert --do_train --do_predict --num_epochs 10 --model_name bert-base-uncased --output_dir ./outputs/STS-B-en-sbert

Contact

Issue(建议)：
邮件我：xuming: xuming624@qq.com
微信我：加我微信号：xuming624, 备注：姓名-公司-NLP 进NLP交流群。

Citation

如果你在研究中使用了nerpy，请按如下格式引用：

APA:

Xu, M. nerpy: Text to vector toolkit (Version 0.0.2) [Computer software]. https://github.com/shibing624/nerpy

BibTeX:

@software{Xu_nerpy_Text_to,
author = {Xu, Ming},
title = {{nerpy: Named Entity Recognition Toolkit}},
url = {https://github.com/shibing624/nerpy},
version = {0.0.2}
}

License

授权协议为 The Apache License 2.0，可免费用做商业用途。请在产品说明中附加nerpy的链接和授权协议。

Contribute

项目代码还很粗糙，如果大家对代码有所改进，欢迎提交回本项目，在提交之前，注意以下两点：

在tests添加相应的单元测试
使用python setup.py test来运行所有单元测试，确保所有单测都是通过的

之后即可提交PR。

Reference

transformers

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.1.2

Jan 24, 2024

0.1.1

Sep 20, 2022

0.1.0

Sep 11, 2022

0.0.7

Sep 2, 2022

0.0.6

Jul 27, 2022

0.0.5

May 25, 2022

0.0.4

May 8, 2022

0.0.3

May 7, 2022

0.0.2

May 7, 2022

This version

0.0.1

May 6, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

nerpy-0.0.1.tar.gz (22.7 kB view hashes)

Uploaded May 6, 2022 Source

Hashes for nerpy-0.0.1.tar.gz

Hashes for nerpy-0.0.1.tar.gz
Algorithm	Hash digest
SHA256	`cc1011d1f6d20d258c329d76c0aae233110a560952c8d8519be3b9034f589caa`
MD5	`fe3f43ab07972f9c7739f2dd0ee0424d`
BLAKE2b-256	`d6f183ba8dcae8c24f438da90debb9eb98c53d0ccefa1e4bc990591e8c4696e7`