Skip to main content

Keyword extraction with spaCy

Project description

spacy_ke: Keyword Extraction with spaCy.

⏳ Installation

pip install spacy_ke

🚀 Quickstart

Usage as a spaCy pipeline component

import spacy

from spacy_ke import Yake

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(Yake(nlp))

doc = nlp(
    "Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence "
    "concerned with the interactions between computers and human language, in particular how to program computers "
    "to process and analyze large amounts of natural language data. "
)

for keyword, score in doc._.extract_keywords(n=3):
    print(keyword, "-", score)

# computer science - 0.020279855002262884
# NLP - 0.035016746977200745
# Natural language processing - 0.04407186487965091

Customization

In the example below, we customize the yake algorithm as follows;

  • Change the candidate selection to chunk (noun phrases). Notice that candidate_selection is a global config property for all keyword extractors, which can be set to either a callable (Doc -> Iterator[Candidate]), a string pointing to instance method (i.e chunk -> ._chunk_selection()), or a dict (i.e {"ngram": 3}).
  • Set lemmatize=True for candidate weighting. Notice that this config property is unique to the Yake implementation.
nlp.add_pipe(Yake(nlp, candidate_selection="chunk", lemmatize=True))

Development

Set up pip & virtualenv

$ pipenv sync -d

Run unit test

$ pipenv run pytest

Run black (code formatter)

$ pipenv run black spacy_ke/ --config=pyproject.toml

References

[1] A Review of Keyphrase Extraction

@article{DBLP:journals/corr/abs-1905-05044,
  author    = {Eirini Papagiannopoulou and
               Grigorios Tsoumakas},
  title     = {A Review of Keyphrase Extraction},
  journal   = {CoRR},
  volume    = {abs/1905.05044},
  year      = {2019},
  url       = {http://arxiv.org/abs/1905.05044},
  archivePrefix = {arXiv},
  eprint    = {1905.05044},
  timestamp = {Tue, 28 May 2019 12:48:08 +0200},
  biburl    = {https://dblp.org/rec/journals/corr/abs-1905-05044.bib},
  bibsource = {dblp computer science bibliography, https://dblp.org}
}

[2] pke: an open source python-based keyphrase extraction toolkit.

@InProceedings{boudin:2016:COLINGDEMO,
  author    = {Boudin, Florian},
  title     = {pke: an open source python-based keyphrase extraction toolkit},
  booktitle = {Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations},
  month     = {December},
  year      = {2016},
  address   = {Osaka, Japan},
  pages     = {69--73},
  url       = {http://aclweb.org/anthology/C16-2015}
}

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

spacy_ke-0.1.0.tar.gz (53.8 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page