Pandas Dataframe integration for spaCy

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.6
Topic
- Software Development :: Build Tools

Project description

DframCy

DframCy is a light-weight utility module to integrate Pandas Dataframe to spaCy's linguistic annotation and training tasks. DframCy provides clean APIs to convert spaCy's linguistic annotations, Matcher and PhraseMatcher information to Pandas dataframe, also supports training and evaluation of NLP pipeline from CSV/XLXS/XLS without any changes to spaCy's underlying APIs.

Getting Started

DframCy can be easily installed. Just need to the following:

Requirements

Python 3.6 or later
Pandas
spaCy >= 3.0.0

Also need to download spaCy's language model:

python -m spacy download en_core_web_sm

For more information refer to: Models & Languages

Installation:

This package can be installed from PyPi by running:

pip install dframcy

To build from source:

git clone https://github.com/yash1994/dframcy.git
cd dframcy
python setup.py install

Usage

Linguistic Annotations

Get linguistic annotation in the dataframe. For linguistic annotations (dataframe column names) refer to spaCy's Token API document.

import spacy
from dframcy import DframCy

nlp = spacy.load("en_core_web_sm")

dframcy = DframCy(nlp)
doc = dframcy.nlp(u"Apple is looking at buying U.K. startup for $1 billion")

# default columns: ["id", "text", "start", "end", "pos_", "tag_", "dep_", "head", "ent_type_"]
annotation_dataframe = dframcy.to_dataframe(doc)

# can also pass columns names (spaCy's linguistic annotation attributes)
annotation_dataframe = dframcy.to_dataframe(doc, columns=["text", "lemma_", "lower_", "is_punct"])

# for separate entity dataframe
token_annotation_dataframe, entity_dataframe = dframcy.to_dataframe(doc, separate_entity_dframe=True)

# custom attributes can also be included
from spacy.tokens import Token
fruit_getter = lambda token: token.text in ("apple", "pear", "banana")
Token.set_extension("is_fruit", getter=fruit_getter)
doc = dframcy.nlp(u"I have an apple")

annotation_dataframe = dframcy.to_dataframe(doc, custom_attributes=["is_fruit"])

Rule-Based Matching

# Token-based Matching
import spacy

nlp = spacy.load("en_core_web_sm")

from dframcy.matcher import DframCyMatcher, DframCyPhraseMatcher, DframCyDependencyMatcher
dframcy_matcher = DframCyMatcher(nlp)
pattern = [{"LOWER": "hello"}, {"IS_PUNCT": True}, {"LOWER": "world"}]
dframcy_matcher.add("HelloWorld", None, pattern)
doc = dframcy_matcher.nlp("Hello, world! Hello world!")
matches_dataframe = dframcy_matcher(doc)

# Phrase Matching
dframcy_phrase_matcher = DframCyPhraseMatcher(nlp)
terms = [u"Barack Obama", u"Angela Merkel",u"Washington, D.C."]
patterns = [dframcy_phrase_matcher.get_nlp().make_doc(text) for text in terms]
dframcy_phrase_matcher.add("TerminologyList", None, *patterns)
doc = dframcy_phrase_matcher.nlp(u"German Chancellor Angela Merkel and US President Barack Obama "
                                u"converse in the Oval Office inside the White House in Washington, D.C.")
phrase_matches_dataframe = dframcy_phrase_matcher(doc)

# Dependency Matching
dframcy_dependency_matcher = DframCyDependencyMatcher(nlp)
pattern = [{"RIGHT_ID": "founded_id", "RIGHT_ATTRS": {"ORTH": "founded"}}]
doc = dframcy_dependency_matcher.nlp(u"Bill Gates founded Microsoft. And Elon Musk founded SpaceX")
dependency_matches_dataframe = dframcy_dependency_matcher(doc)

Command Line Interface

Dframcy supports command-line arguments for the conversion of a plain text file to linguistically annotated text in CSV/JSON format. Previous versions of Dframcy were used to support CLI utilities for training and evaluation of spaCy models from CSV/XLS files. After the v3 release, spaCy's training pipeline has become much more flexible and robust so didn't want to introduce additional step using Dframcy for just format conversion (CSV/XLS to spaCy’s binary format).

# convert
dframcy dframe -i plain_text.txt -o annotations.csv -f csv

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Programming Language
- Python :: 3.6
Topic
- Software Development :: Build Tools

Release history Release notifications | RSS feed

This version

0.1.6

Feb 16, 2021

0.1.5

Apr 28, 2020

0.1.4

Apr 9, 2020

0.1.3

Nov 4, 2019

0.1.2

Oct 14, 2019

0.1.1

Oct 14, 2019

0.1.0

Oct 14, 2019

0.0.1

Oct 14, 2019

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

dframcy-0.1.6.tar.gz (13.4 kB view hashes)

Uploaded Feb 16, 2021 Source

Built Distribution

dframcy-0.1.6-py3-none-any.whl (13.3 kB view hashes)

Uploaded Feb 16, 2021 Python 3

Hashes for dframcy-0.1.6.tar.gz

Hashes for dframcy-0.1.6.tar.gz
Algorithm	Hash digest
SHA256	`23ac9e64430ac5bba51980b99cbdef9585f88c6e4a7bb9d62a65dba4a8241bec`
MD5	`74bf2bfe31732ceb44bf91fd1395c5c3`
BLAKE2b-256	`30936b842ecc160b77d76954b07ad3311f6c039d4718669dc125f88c248e62ff`

Hashes for dframcy-0.1.6-py3-none-any.whl

Hashes for dframcy-0.1.6-py3-none-any.whl
Algorithm	Hash digest
SHA256	`119ee537697717a7e96a5780cca11b6ed6fa190c3004d8402a88850a9a8b045c`
MD5	`685b6b4540999342f2cde6502fc72750`
BLAKE2b-256	`7cfbf5298c497597d20fe861a8032d56fb78a3f5fc33535a62f3033d1235fc56`