Skip to main content

Anchorman takes a list of terms and a text. It finds the terms in this text and replaces them with another representation.

Project description

Welcome to Anchorman

Turn your text into hypertext and enrich the content. Anchorman takes a list of terms and a text. It finds the terms in this text and replaces them with another representation.

The replacement is guided by rules like the following. Each term is checked against the rules and will be applied if valid.

# How many items will be marked at all in the text.
replaces_at_all: 5

# Input term has to be exact match in text.
case_sensitive: true

The text is analysed via intervalltree and the replacement happens on position and context.

Features

  • replacement rules via settings

  • consider text units in the rules (e.g. paragraphs)

  • add your own element validator made easy

  • replace only n items of the same item

Usage

The first element of elements is find in text and replaced with a link tag.

>>> from anchorman import annotate
>>> text = 'The quick brown fox jumps over the lazy dog.'
>>> elements = [{'fox': {'value': '/wiki/fox', 'data-type': 'animal'}}]
>>> print annotate(text, elements)
'The quick brown <a href="/wiki/fox" data-type="animal">fox</a> jumps over the lazy dog .'

See etc/link.yaml for options to configure the replacement process and rules.

The item validator

Inherit your own item validator. Item is the potential replacement. Candidates is a list of processed and valid items ready to apply to text. This unit bears valid items ready to apply to text in this intervall or unit.

>>> from anchorman.generator.candidate import get_data_of
>>> def validator(item, candidates, this_unit, setting):
...    values = get_data_of(item)
...    if values['score'] == 42.0 and values['type'] == 'city':
...        return True
...    else:
...        return False
...
>>> print annotate(text, elements, own_validator=[validator])

Apply schema.org

Not so handy approach is to create contexts with multiple annotation calls. But the logic to annotate data around and in each other is pretty hacky as the following example shows:

>>> s_text = 'Angela Merkel, CDU, Bundeskanzlerin'
>>> s1_elements = [
...     {"Angela Merkel, CDU, Bundeskanzlerin": {
...         'itemtype': 'http://schema.org/Person',
...         'itemscope': None}}
...     ]
...
>>> s11_elements = [
...     {"CDU": {
...         'itemtype': 'http://schema.org/Organization',
...         'itemscope': None}}
...     ]
...
>>> s2_elements = [
...     {"Angela Merkel": {
...         'itemprop': 'name'}},
...     {"CDU": {
...         'itemprop': 'name'}},
...     {"Bundeskanzlerin": {
...         'itemprop': 'jobtitle'}}
...     ]
...
>>> from anchorman import get_config
>>> cfg = get_config()
>>> unit = {'key': 't', 'name': 'text'}
>>> cfg['setting']['text_unit'].update(unit)
>>> cfg['markup'] = {'tag': {'tag': 'div'}}
>>> annotated = annotate(s_text, s1_elements, config=cfg)
>>> annotated2 = annotate(annotated, s11_elements, config=cfg)
>>> cfg3 = cfg.copy()
>>> cfg3['markup'] = {'tag': {'tag': 'span'}}
>>> annotated3 = annotate(annotated2, s2_elements, config=cfg3)

Then text annotated3 looks like this:

<div itemscope itemtype="http://schema.org/Person">
    <span itemprop="name">Angela Merkel</span>,
    <div itemscope itemtype="http://schema.org/Organization">
        <span itemprop="name">CDU</span>
    </div>,
    <span itemprop="jobtitle">Bundeskanzlerin</span>
</div>

Installation

To install Anchorman, simply:

pip install anchorman

Credits and contributions

We published this at github and pypi to provide our solution to you. Pleased for feedback and contributions.

Thanks Tarn Barford for inspiration and first steps.

Todo

  • add sentence splitter or add to readme example with <s></s>

  • check if position exist in input and save extra processing

  • check context of replacement: do not add links in links, or inline of overlapping elements

  • replace only one item of an entity > e.g. A. Merkel, Mum Merkel, …

  • implement a replacement logic for coreference chains

  • add more schema.org examples

  • html.parser vs lxml in bs4 - think about config

  • ValueError: IntervalTree: Null Interval objects

  • validate text und elements

Feedback and thanks for reading.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anchorman-0.3.1.tar.gz (6.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page