A multi-lingual approach to AllenNLP CoReference Resolution, along with a wrapper for spaCy.

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Crosslingual Coreference

Coreference is amazing but the data required for training a model is very scarce. In our case, the available training for non-English languages also proved to be poorly annotated. Crosslingual Coreference, therefore, uses the assumption a trained model with English data and cross-lingual embeddings should work for languages with similar sentence structures.

Install

pip install crosslingual-coreference

Quickstart

from crosslingual_coreference import Predictor

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)

# choose minilm for speed/memory and info_xlm for accuracy
predictor = Predictor(
    language="en_core_web_sm", device=-1, model_name="minilm"
)

print(predictor.predict(text)["resolved_text"])
print(predictor.pipe([text])[0]["resolved_text"])
# Note you can also get 'cluster_heads' and 'clusters'
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.

Models

As of now, there are two models available "spanbert", "info_xlm", "xlm_roberta", "minilm", which scored 83, 77, 74 and 74 on OntoNotes Release 5.0 English data, respectively.

The "minilm" model is the best quality speed trade-off for both mult-lingual and english texts.
The "info_xlm" model produces the best quality for multi-lingual texts.
The AllenNLP "spanbert" model produces the best quality for english texts.

Chunking/batching to resolve memory OOM errors

from crosslingual_coreference import Predictor

predictor = Predictor(
    language="en_core_web_sm",
    device=0,
    model_name="minilm",
    chunk_size=2500,
    chunk_overlap=2,
)

Use spaCy pipeline

import spacy

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)


nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(
    "xx_coref", config={"chunk_size": 2500, "chunk_overlap": 2, "device": 0}
)

doc = nlp(text)
print(doc._.coref_clusters)
# Output
#
# [[[4, 5], [7, 7], [27, 27], [36, 36]],
# [[12, 12], [15, 16]],
# [[9, 10], [27, 28]],
# [[22, 23], [31, 31]]]
print(doc._.resolved_text)
# Output
#
# Do not forget about Momofuku Ando!
# Momofuku Ando created instant noodles in Osaka.
# At Osaka, Nissin was founded.
# Many students survived by eating instant noodles,
# but Many students don't even know Momofuku Ando.
print(doc._.cluster_heads)
# Output
#
# {Momofuku Ando: [5, 6],
# instant noodles: [11, 12],
# Osaka: [14, 14],
# Nissin: [21, 21],
# Many students: [26, 27]}

Visualize spacy pipeline

This only works with spacy >= 3.3.

import spacy
from spacy.tokens import Span
from spacy import displacy

text = (
    "Do not forget about Momofuku Ando! He created instant noodles in Osaka. At"
    " that location, Nissin was founded. Many students survived by eating these"
    " noodles, but they don't even know him."
)

nlp = spacy.load("nl_core_news_sm")
nlp.add_pipe("xx_coref", config={"model_name": "minilm"})
doc = nlp(text)
spans = []
for idx, cluster in enumerate(doc._.coref_clusters):
    for span in cluster:
        spans.append(
            Span(doc, span[0], span[1]+1, str(idx).upper())
        )

doc.spans["custom"] = spans

displacy.render(doc, style="span", options={"spans_key": "custom"})

More Examples

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.3.1

Jun 19, 2023

0.3

Apr 5, 2023

0.2.9

Sep 24, 2022

0.2.8

Jul 14, 2022

0.2.7

Jul 14, 2022

0.2.6

Jun 8, 2022

0.2.5

May 25, 2022

0.2.4

May 10, 2022

0.2.3

May 5, 2022

0.2.2

May 5, 2022

0.2.1

Apr 13, 2022

0.2.0

Apr 3, 2022

0.1.5

Mar 31, 2022

0.1.4

Mar 30, 2022

0.1.3

Mar 29, 2022

0.1.2

Mar 29, 2022

0.1.1

Mar 28, 2022

0.1.0

Mar 28, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

crosslingual-coreference-0.3.1.tar.gz (11.7 kB view hashes)

Uploaded Jun 19, 2023 Source

Built Distribution

crosslingual_coreference-0.3.1-py3-none-any.whl (12.7 kB view hashes)

Uploaded Jun 19, 2023 Python 3

Hashes for crosslingual-coreference-0.3.1.tar.gz

Hashes for crosslingual-coreference-0.3.1.tar.gz
Algorithm	Hash digest
SHA256	`cbd46de0afedf75d3315c39e9fecb851112e29cb7d8b3d85fdb7eb39ac63c25e`
MD5	`8f91bf2ff7e8c471dbda8972ce098147`
BLAKE2b-256	`81a07dca701ec4ad2eef0df1de5d5952dbae2c4c86ade79fb9a4e23bd36dd1d4`

Hashes for crosslingual_coreference-0.3.1-py3-none-any.whl

Hashes for crosslingual_coreference-0.3.1-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bd44ec22b2a1a02eb03203d04c4c92d819e2ac929baa2b825f21de7106beeedd`
MD5	`a2de1f451d34036e0d1286c88bda50db`
BLAKE2b-256	`2b8de4ad53fd3a0f805658a140bd8e0affee4436a05831141660dc2a4103fcda`