excelcy

Excel Integration with SpaCy. Includes, Entity training, Entity matcher pipe.

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Project description

https://travis-ci.com/kororo/excelcy.svg?branch=master

https://coveralls.io/repos/github/kororo/excelcy/badge.svg?branch=master

ExcelCy is a SpaCy toolkit to help improve the data training experiences. It provides easy annotation using Excel file format. It has helper to pre-train entity annotation with phrase and regex matcher pipe.

ExcelCy is Powerful

ExcelCy focuses on the needs of training data into spaCy data model. Illustration below is based on the documentation in Simple Style Training.

TRAIN_DATA = [
     ("Uber blew through $1 million a week", {'entities': [(0, 4, 'ORG')]}),
     ("Google rebrands its business apps", {'entities': [(0, 6, "ORG")]})]

nlp = spacy.blank('en')
optimizer = nlp.begin_training()
for i in range(20):
    random.shuffle(TRAIN_DATA)
    for text, annotations in TRAIN_DATA:
        nlp.update([text], [annotations], sgd=optimizer)
nlp.to_disk('/model')

The TRAIN_DATA, describes list of text sentences including the annotated entities to be trained. It is cumbersome to always count the characters. With ExcelCy, start,end characters can be omitted.

# this is illustration presentation only, please use Excel for now which described below.
self.data_train = {
    '1.0': {
        'text': 'Uber blew through $1 million a week',
        'rows': [{'subtext': 'Uber', 'entity': 'ORG'}]
    }
}

Also, it is lots of task if there are multiple text sentences with Uber as subtext. With ExcelCy, there is another way to automatically add the annotation Entity using pipe-matcher either exact match or regex.

# this is illustration presentation only, please use Excel for now which desribed below.
self.data_train = {
    '1.0': {
        'text': 'Uber blew through $1 million a week',
        'rows': [{'subtext': 'Uber', 'entity': 'ORG'}]
    }
}

Features

Add training data from Excel.
Add custom Entity labels.
Annotate Entity from given sentences without (start, end) char position.
Rule based phrase matching using PhraseMatcher
Rule based matching using regex + Matcher
Add Entity training data using pipe matcher described above.

Install

Either use the famous pip or clone this repository and execute the setup.py file.

$ pip install excelcy

# ensure you have the language model installed before
$ spacy download en

Train

To train the SpaCy model:

# ensure data model
spacy download en

# download example data
wget https://github.com/kororo/excelcy/tree/master/excelcy/tests/data/test_data_28.xlsx

import os
import tempfile
import spacy
from excelcy import ExcelCy

excelcy = ExcelCy()
excelcy.train(data_path='test_data_28.xlsx')

Note: tests/data/test_data_28.xlsx

Test the training manually:

import os
import spacy
import tempfile
from excelcy import ExcelCy

# create nlp data model based on "en_core_web_sm" and save it to "test_data_01"
base = 'en_core_web_sm'
nlp = spacy.load(base)

# save and reload to verify

# create dir nlp
name = os.path.join(tempfile.gettempdir(), 'nlp/test_data_01')
os.makedirs(name, exist_ok=True)
# save it
nlp.to_disk(name)
nlp = spacy.load(name)

# test the NER
text = 'Uber blew through $1 million a week'
doc = nlp(text)
ents = set([(ent.text, ent.label_) for ent in doc.ents])

# this shows current model in test_data_01, has no "Uber" identified as ORG
assert ents == {('$1 million', 'MONEY')}

# lets train
excelcy = ExcelCy()
# copy excel from https://github.com/kororo/excelcy/tree/master/excelcy/tests/data/test_data_01.xlsx
# ensure name is "nlp/test_data_01" inside config sheet.
# ensure directory data model "nlp/test_data_01" is created and exist.
excelcy.train(data_path='tests/data/test_data_01.xlsx')

# reload the data model
nlp = spacy.load(name)

# test the NER
doc = nlp(text)
ents = set([(ent.text, ent.label_) for ent in doc.ents])

# this shows current model in test_data_01, has "Uber" identified as ORG
assert ents == {('Uber', 'ORG'), ('$1 million', 'MONEY')}

Data

Currently ExcelCy only support Excel format. The DataTrainer needs three pieces of information:

Sheet: config

Extra configuration for the training.

base: The initial SpaCy data model to begin with. Described in here
name: The absolute/relative path to save the SpaCy data model after training. It is possible to use this to read existing data model and training on top existing one. The path always relative to file.
train.iteration: How many iteration to train described here
train.drop: How much to dropout rate based on here
train.matcher: Enable to add entity annotation based on pipe-matcher, described below.

Sheet: train

List of text sentences to train. This includes list of subtext to annotate any identified Entity. Any non-existence Entity in nlp, it will automatically added using “ner” pipe, similar to here.

id: It follow format of “TEXT_ID.SUBTEXT_ID”
text: The text sentence to train
subtext: The portion of text to annotate the Entity
entity: The label Entity, this can be existing or new label.

Notes:

“text” and “subtext” needs to be case-sensitive.
“subtext” is not affected by the tokenisation. It is possible to annotate multiple tokens for an Entity label.

Examples:

Sheet: pipe-matcher

This list helps if there are lots of subtext occurrence in “train” sheet.

If type is “nlp”:

pattern: The exact phrase match to select subtext
type: nlp
entity: The annotated Entity label

If type is “regex”:

pattern: The regex to select subtext
type: regex
entity: The annotated Entity label

Examples:

{‘pattern’: ‘$1 million’, ‘type’: ‘nlp’, ‘entity’: ‘MONEY’}
{‘pattern’: ‘Ubers?’, ‘type’: ‘regex’, ‘entity’: ‘ORG’}

TODO

[X] Start get cracking into spaCy
[ ] More features
- [ ] Add special case for tokenisation described here
- [ ] Add more file format such as YML, JSON. Make standardise and well documented on data structure.
- [ ] Add custom tags.
- [ ] Add report outputs such as identified entity, tag
- [ ] Add support to accept sentences to Excel
- [ ] Add more data structure check in Excel and more warning messages
- [ ] Add classifier text training described here
- [ ] Add exception subtext when there is multiple occurrence in text. (Google Pay is awesome Google product)
- [ ] Add tag annotation in sheet: train
- [ ] Add list of patterns easily (such as kitten breed)
[ ] Improve speed and performance
[ ] Create data standard
[ ] 100% coverage target with branch on
[ ] Submit to Prodigy Universe

Acknowledgement

This project uses other awesome projects:

spaCy

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 4 - Beta
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Natural Language
- English
Operating System
- OS Independent
Programming Language

Release history Release notifications | RSS feed

0.4.1

Aug 23, 2020

0.3.3

Mar 13, 2019

0.3.2

Aug 12, 2018

0.3.1

Jul 29, 2018

0.3.0

Jul 29, 2018

0.2.4

Jul 23, 2018

This version

0.1.2

Jul 19, 2018

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

excelcy-0.1.2.tar.gz (12.7 kB view hashes)

Uploaded Jul 19, 2018 Source

Built Distribution

excelcy-0.1.2-py3-none-any.whl (11.2 kB view hashes)

Uploaded Jul 19, 2018 Python 3

Hashes for excelcy-0.1.2.tar.gz

Hashes for excelcy-0.1.2.tar.gz
Algorithm	Hash digest
SHA256	`6723b9509f457005302dcf5bfb531c446a76ecbe4088c07aca346226d665ad90`
MD5	`f0efb63f9ea501591cdba6c0ddece476`
BLAKE2b-256	`a90af5b0667e6b8d21c146de6b618cc1fe71c086aa616b15885879418b881a86`

Hashes for excelcy-0.1.2-py3-none-any.whl

Hashes for excelcy-0.1.2-py3-none-any.whl
Algorithm	Hash digest
SHA256	`b9d4185e0e5287d8f0bfccb25cfcc1b85c8789919049c78a0860f8d4ed9cda27`
MD5	`608d6025c05c2e0f4c418f3709990131`
BLAKE2b-256	`9761737002f32f769d320d38900b084792aa4c6ae1ad7df9a707078894e2c75f`