Skip to main content

RAKE short for Rapid Automatic Keyword Extraction algorithm, is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text.

Project description

pypiv pyv Licence Build Status Coverage Status

RAKE short for Rapid Automatic Keyword Extraction algorithm, is a domain independent keyword extraction algorithm which tries to determine key phrases in a body of text by analyzing the frequency of word appearance and its co-occurance with other words in the text.

Demo

Features

  • Ridiculously simple interface.

  • Configurable word and sentence tokenizers, language based stop words etc

  • Configurable ranking metric.

Setup

Using pip

pip install rake-nltk

Directly from the repository

git clone https://github.com/csurfer/rake-nltk.git
python rake-nltk/setup.py install

Quick Start

from rake_nltk import Rake

# Uses stopwords for english from NLTK, and all puntuation characters by
# default
r = Rake()

# Extraction given the text.
r.extract_keywords_from_text(<text to process>)

# Extraction given the list of strings where each string is a sentence.
r.extract_keywords_from_sentences(<list of sentences>)

# To get keyword phrases ranked highest to lowest.
r.get_ranked_phrases()

# To get keyword phrases ranked highest to lowest with scores.
r.get_ranked_phrases_with_scores()

Debugging Setup

If you see a stopwords error, it means that you do not have the corpus stopwords downloaded from NLTK. You can download it using command below.

python -c "import nltk; nltk.download('stopwords')"

References

This is a python implementation of the algorithm as mentioned in paper Automatic keyword extraction from individual documents by Stuart Rose, Dave Engel, Nick Cramer and Wendy Cowley

Why I chose to implement it myself?

  • It is extremely fun to implement algorithms by reading papers. It is the digital equivalent of DIY kits.

  • There are some rather popular implementations out there, in python(aneesha/RAKE) and node(waseem18/node-rake) but neither seemed to use the power of NLTK. By making NLTK an integral part of the implementation I get the flexibility and power to extend it in other creative ways, if I see fit later, without having to implement everything myself.

  • I plan to use it in my other pet projects to come and wanted it to be modular and tunable and this way I have complete control.

Contributing

Bug Reports and Feature Requests

Please use issue tracker for reporting bugs or feature requests.

Development

  1. Checkout the repository.

  2. Make your changes and add/update relavent tests.

  3. Install `poetry` using `pip install poetry`.

  4. Run `poetry install` to create project’s virtual environment.

  5. Run tests using `poetry run tox` (Any python versions which you don’t have checked out will fail this). Fix failing tests and repeat.

  6. Make documentation changes that are relavant.

  7. Install `pre-commit` using `pip install pre-commit` and run `pre-commit run –all-files` to do lint checks.

  8. Generate documentation using `poetry run sphinx-build -b html docs/ docs/_build/html`.

  9. Generate `requirements.txt` for automated testing using `poetry export –dev –without-hashes -f requirements.txt > requirements.txt`.

  10. Commit the changes and raise a pull request.

Buy the developer a cup of coffee!

If you found the utility helpful you can buy me a cup of coffee using

Donate

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page