Annotation tool for NER tasks on Jupyter

These details have been verified by PyPI

Maintainers

antoinegrelety eturc mathildegallois nicolas.gaudin thomaub vmatthys xelnor

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

PyLighter: Annotation tool for NER tasks

PyLighter is a tool that allows data scientists to annotate a corpus of documents directly on Jupyter for NER (Named Entity Recognition) tasks.

Installation
Basic usage
Advanced usage
Contributing
- Testing
License

Installation

From Pypi: https://pypi.org/project/pylighter/

pip install pylighter
jupyter nbextension enable --py widgetsnbextension

From Github: https://github.com/PayLead/PyLighter

git clone git@github.com:PayLead/PyLighter.git
cd PyLighter
python setup.py install
jupyter nbextension enable --py widgetsnbextension

Demos

The demo folder contains working examples of PyLighter in use. To view them, open any of the ipynb files in Jupyter.

Basic usage

The use case of PyLighter is to easily annotate a corpus in Jupyter. So let's first define a corpus for this example:

corpus = [
    "PyLighter is an annotation tool for NER tasks directly on Jupyter. "
    + "It aims on helping data scientists easily and quickly annotate datasets. "
    + "This tool was developed by Paylead.",
    "PayLead is a fintech company specializing in transaction data analysis. "
    + "Paylead brings retail and banking together, so customers get rewarded when they buy. "
    + "Welcome to the data-for-value economy."
]

Now let's start annotating !

from pylighter import Annotation

annotation = Annotation(corpus)

Running that cell gives you the following output:

You can know start annotating entities using the predefined labels l1, l2, etc.

When your annotation is finished, you can either click on the save button or retrieve the results in the current Notebook.

The save button will save the results in a csv file named annotation.csv with two columns: the documents and the labels.
You can access the labels of your annotations in annotation.labels

Note: The given labels are in IOB2 format.

Advanced usage

The above example works just fine but PyLighter can be customized to best fit your specific use case.

Using an already annotated corpus

In most cases, you want to use an already annotated corpus or simply continue your annotation.

To this, you can use the argument named labels with the labels of the corpus. Moreover, if you stopped at the i^th document, you can directly get back to where you stopped with start_index=i.

screenshot_pre_annotated

You can see more on that with this demo.

Changing labels names

PyLighter uses l1, l2, ...., l7 as default labels names, but in most cases, you want to have explicit labels such as Noun, Verb, etc.

You can define your own labels names with the argument labels_names. You can also define your own colors for your labels with the argument labels_colors in HEX format.

screenshot_labels_changed

You can see more on that with this demo.

Document styling

You can adjust the font size, the minimal distance between two characters and the size of spaces with the argument char_params.

Default value for char_params is:

# Each field expects css value as a string (ex:"10px", "1em", "large", etc.)
char_params = {
    "font_size": "medium", 
    "width_white_space": "1Opx",
    "min_width_between_chars": "4px",
}

Adding additional information

In some cases, you may want to know additional information about the current document, such as the source of it.

To do this, you can use the argument additional_infos. This argument must be a pandas DataFrame of shape (size of the corpus, number of additional information). The i^th row of the DataFrame will be associated with the i^th element of the corpus.

The elements of the given DataFrame need to have a proper string representation to be correctly displayed.

For instance, to add the source to each element of the corpus:

import pandas as pd

# define corpus of size 2
additional_infos = pd.DataFrame({"source":["Github", "Paylead.fr"]})
annotation = Annotation(corpus, additional_infos=additional_infos)

The result will be:

screenshot_additional_information

You can see more on that with this demo.

Adding additional outputs

In some cases, you want to flag a document as difficult to annotate, or spot as wrong, or give a value that estimates your confidence in your annotation, etc. In short, you need to return additional information.

To do this, you can use the argument: additional_outputs_elements. This argument expects a list of pylighter.AdditionalOutputElement.

A pylighter.AdditionalOutputElement is defined like this:

from pyligher import AdditionalOutputElement

AdditionalOutputElement(
    name="name_of_my_element",
    display_type="type_of_display" # checkbox, int_text, float_text, text, text_area
    description="Description of the element to display",
    default_value="Default value for the element"
)

Here is an example:

screenshot_additional_outputs

Note: Additional outputs will be added to the save file. But you can also retrieve them with annotation.additional_outputs_values. You can also use previously returned additional outputs values with the argument: additional_outputs_values (same as the label).

You can see more on that with this demo.

Using keyboard shortcuts

Annotation tasks are pretty boring. Thus you may want to use keyboard shortcuts to easily change documents or to select an other label.

By default, there are only a few shortcuts defined:

next: Alt + n
previous: Alt + p
skip: Alt + s
save: Shift + Alt + s

However, you can fully customize them with the arguments: standard_shortcuts and labels_shorcuts. The standard_shortcuts argument is used to redefined shortcuts for the standard buttons such as the next button whereas the

A shortcut is defined like this:

from pylighter import Shortcut

Shortcut(
    name="skip",  # Name of the button to bind on (ex: "next", "skip") or name of the label (ex: "l1", "l2", or one you defined)
    key="Ò",  # Usually represents the character that is displayed.
    code="KeyS",  # Usually represents the key that is pressed.
    shift_key=False,  # Wether the shift key is pressed
    alt_key=True,
    ctrl_key=False
)

It is pretty hard to know what is the value for the key and the value for the code. It depends on a lot of different factors such as your keyboard, your browser, etc.

Thus, you can use the ShortcutHelper to pick the right shortcut. Here is an example of it.

from pylighter import ShortcutHelper

ShortcutHelper()

screenshot_shortcut_helper

You can see more on that with this demo.

Contributing

Testing

PyLighter uses pytest. Thus, tests can be run with:

make test

PyLighter uses flake8, isort and check-manifest to control the quality of the code. You can test the quality of the code with:

make test-quality

If you wish to test everything including the packaging, you can run:

make test-all

License

MIT License

Project details

These details have been verified by PyPI

Maintainers

antoinegrelety eturc mathildegallois nicolas.gaudin thomaub vmatthys xelnor

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.0.3

Sep 23, 2021

0.0.2

Nov 24, 2020

0.0.1

Nov 12, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pylighter-0.0.3.tar.gz (32.6 kB view hashes)

Uploaded Sep 23, 2021 Source

Built Distribution

pylighter-0.0.3-py2.py3-none-any.whl (25.5 kB view hashes)

Uploaded Sep 23, 2021 Python 2 Python 3

Hashes for pylighter-0.0.3.tar.gz

Hashes for pylighter-0.0.3.tar.gz
Algorithm	Hash digest
SHA256	`547fa47252b0f63e0a1301facb26b1801f5afc1d1f1bb632a95d6088f5034b00`
MD5	`a78382fd0681655301fecabee63b890f`
BLAKE2b-256	`b37bbf45bebdd13072562782fa2c0eaad4adc1677f5e41d2587bbd144bff08e7`

Hashes for pylighter-0.0.3-py2.py3-none-any.whl

Hashes for pylighter-0.0.3-py2.py3-none-any.whl
Algorithm	Hash digest
SHA256	`f7e48bf42a8d2a6dc396b5ad36c3148578f931f3cc75bc02c595ccbe27442d3b`
MD5	`3e30f548621b86be6d2d7cfe63af6ee6`
BLAKE2b-256	`fb7539d5f954bc173deefc9a1176884affac32dd259027e33957704932026810`