A utility library to assist in parsing natural language text.
Project description
Zensols Natural Language Parsing
This framework wraps the spaCy framework and creates light weight features in a class hierarchy that reflects the structure of natural language. The motivation is to generate features from the parsed text in an object oriented fashion that is fast and easy to pickle.
- See the full documentation.
- Paper on arXiv.
Other features include:
- Parse and normalize a stream of tokens as stop words, punctuation filters, up/down casing, porter stemming and others.
- Detached features that are safe and easy to pickle to disk.
- Configuration drive parsing and token normalization using configuration factories.
- Pretty print functionality for easy natural language feature selection.
Documentation
Usage
An example that provides ways to configure the parser is given
here. See the makefile
or ./run.py -h
for command line
usage.
A very simple example is given below:
from io import StringIO
from zensols.config import ImportIniConfig, ImportConfigFactory
from zensols.nlp import FeatureDocument, FeatureDocumentParser
CONFIG = """
[import]
sections = list: imp_conf
# import the `zensols.nlp` library
[imp_conf]
type = importini
config_files = list: resource(zensols.nlp): resources/obj.conf
# override the parse to keep only the norm, ent
[doc_parser]
token_feature_ids = set: ent_, tag_
"""
if (__name__ == '__main__'):
fac = ImportConfigFactory(ImportIniConfig(StringIO(CONFIG)))
doc_parser: FeatureDocumentParser = fac('doc_parser')
sent = 'He was George Washington and first president of the United States.'
doc: FeatureDocument = doc_parser(sent)
for tok in doc.tokens:
tok.write()
This uses a resource library to source in the configuration from this package so minimal configuration is necessary.
See the feature documents for more information.
Obtaining / Installing
- The easist way to install the command line program is via the
pip
installer:pip3 install zensols.nlp
- Install at least one spaCy model:
python -m spacy download en_core_web_sm
Binaries are also available on pypi.
Attribution
This project, or example code, uses:
- spaCy for natural language parsing
- msgpack and smart-open for Python disk serialization
- nltk for the porter stemmer functionality
Citation
If you use this project in your research please use the following BibTeX entry:
@article{Landes_DiEugenio_Caragea_2021,
title={DeepZensols: Deep Natural Language Processing Framework},
url={http://arxiv.org/abs/2109.03383},
note={arXiv: 2109.03383},
journal={arXiv:2109.03383 [cs]},
author={Landes, Paul and Di Eugenio, Barbara and Caragea, Cornelia},
year={2021},
month={Sep}
}
Changelog
An extensive changelog is available here.
License
Copyright (c) 2020 - 2021 Paul Landes
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distributions
Built Distributions
Hashes for zensols.nlp-1.5.0-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | 67255d732e6596eec3913c93a91d30d827ff96885200415735c0ee6cd28f6534 |
|
MD5 | ef4e7415759fb1d31811ad17be4e576a |
|
BLAKE2b-256 | 656956e1642909e3a433a66d82c07acf3d6ec312d7837fd48e80ca25658080b6 |