SpaCy pipeline component for adding document or sentence-level ngrams.
Project description
spacy-ngram
Table of Contents
About the Project
SpaCy pipeline component for adding document or sentence-level ngrams.
Getting Started
Prerequisites
- Python 3.10+
Installation
- Install from PyPI:
pip install spacy-ngram
- This will install
spacy
, butspacy
requires a model:- E.g., download:
python -m spacy download en_core_web_sm
- Or, manually download and install with
pip install ...
- E.g., download:
Usage
Quick Start
spacy-ngram
allows the creation of ngrams of any size. These will be added at either the document- or sentence-level.
import spacy
from spacy_ngram import NgramComponent
nlp = spacy.load('en_core_web_sm') # or whatever model you downloaded
nlp.add_pipe('spacy-ngram') # default to document-level ngrams, removing stopwords
text = 'Quark soup is an interacting localized assembly of quarks and gluons.'
doc = nlp(text)
print(doc._.ngram_1)
# ['quark', 'soup', 'interact', 'localize', 'assembly', 'quark', 'gluon']
print(doc._.ngram_2)
# ['quark_soup', 'soup_interact', 'interact_localize', 'localize_assembly', 'assembly_quark', 'quark_gluon']
Quick Reference
spacy-ngram
creates new extensions under the Doc
and/or Span
classes, depending on the parameters (it defaults
to Doc
). The extension begins with the prefix ngram_
followed by the level of ngram desired (e.g., ngram_1
).
- unigram (
1
included inngrams
argument):Doc._.ngram_1
- bigram (
2
included inngrams
argument):Doc._.ngram_2
Pipeline Parameters
The pipeline can be parametrized depending on needs. E.g., to process at the sentence-level:
nlp.add_pipe('spacy-ngram', config={
'sentence_level': True, # initialize sentence-level ngrams
'doc_level': False, # skip processing at document-level
'ngrams': (2, 3), # bi- and trigram only
})
doc = nlp(text)
sentence = list(doc.sents)
print(sentence._.ngram_1)
# raises AttributeError
sentence._.ngram_2 # returns list of bigrams
sentence._.ngram_3 # returns list of trigrams
Parameter | Type | Default | Description |
---|---|---|---|
ngrams |
tuple[int] |
(1, 2) |
1 for unigram, 2 for bigram, etc. |
include_bos |
bool |
False |
include BOS tags at end of sentence/document |
include_eos |
bool |
False |
include EOS tags at end of sentence/document |
sentence_level |
bool |
False |
perform ngram-extraction at sentence-level |
doc_level |
bool |
True |
perform ngram-extraction at document-level |
Versions
Uses SEMVER.
See https://github.com/kpwhri/spacy-ngram/releases.
Roadmap
See the open issues for a list of proposed features (and known issues).
Contributing
Any contributions you make are greatly appreciated.
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/AmazingFeature
) - Commit your Changes (
git commit -m 'Add some AmazingFeature'
) - Push to the Branch (
git push origin feature/AmazingFeature
) - Open a Pull Request
License
Distributed under the MIT License.
See LICENSE
or https://kpwhri.mit-license.org for more information.
Contact
Please use the issue tracker.
Acknowledgements
Project details
Release history Release notifications | RSS feed
Download files
Download the file for your platform. If you're not sure which to choose, learn more about installing packages.
Source Distribution
Built Distribution
Hashes for spacy_ngram-0.0.1-py3-none-any.whl
Algorithm | Hash digest | |
---|---|---|
SHA256 | c2d045ab96abbc84b6d73fb1f857d3fc87a9d46ccd2aee8a17508d44dba4f37f |
|
MD5 | e994dfbb3259db204de708e4860ba9b8 |
|
BLAKE2b-256 | 3649ad2a24f4d1f4f139300dcdb9e86baa16f164af4ba2a42a6dae120ba50a44 |