Skip to main content

UNKNOWN

Project description

Anchorman

Turn your text into hypertext and enrich the content. Anchorman takes a list of terms and a text. It finds the terms in your text and replaces them with an html-element representation.

The replacement is guided by rules like in the following. Each term is checked against the rules and will be applied if valide.

# How many items will be marked at all in the text.
replaces_at_all: 5

# Input term has to be exact match in text.
case_sensitive: true

The text is analysed via intervalltree and the replacement happens on position and context.

Features

  • replacement rules via settings

  • consider text units in the rules (e.g. paragraphs)

  • add your own element validator made easy

Usage

The first element of elements is find in text and replaced with a link tag.

>>> from anchorman.main import annotate
>>> text = 'The quick brown fox jumps over the lazy dog.'
>>> elements = [{'fox': {'value': '/wiki/fox', 'data-type': 'animal'}}]
>>> print annotate(text, elements)
'The quick brown <a href="/wiki/fox" data-type="animal">fox</a> jumps over the lazy dog .'

See etc/link.yaml for options to configure the replacement process or the rules.

The item validator

Inherit your own item validator. Item is the potential replacement. Candidates is a list of processed and valide items ready to apply to text. This unit bears valide items ready to apply to text in this intervall or unit.

>>> from anchorman.generator.candidate import data_val
>>> def validator(item, candidates, this_unit, setting):
...    values = data_val(item, None)
...    if values['score'] == 42.0 and values['type'] == 'city':
...        return True
...    else:
...        return False
...
>>> print annotate(text, elements, own_validator=[validator])

Apply schema.org

Not so handy approach is to create contexts with multiple annotation calls. But the logic to annotate data around and in each other is pretty hacky as the following example shows:

>>> s_text = 'Angela Merkel, CDU, Bundeskanzlerin'
>>> s1_elements = [
...     {"Angela Merkel, CDU, Bundeskanzlerin": {
...         'itemtype': 'http://schema.org/Person',
...         'itemscope': None}}
...     ]
...
>>> s11_elements = [
...     {"CDU": {
...         'itemtype': 'http://schema.org/Organization',
...         'itemscope': None}}
...     ]
...
>>> s2_elements = [
...     {"Angela Merkel": {
...         'itemprop': 'name'}},
...     {"CDU": {
...         'itemprop': 'name'}},
...     {"Bundeskanzlerin": {
...         'itemprop': 'jobtitle'}}
...     ]
...
>>> cfg = get_config()
>>> unit = {'key': 't', 'name': 'text'}
>>> cfg['setting']['text_unit'].update(unit)
>>> cfg['markup'] = {'tag': {'tag': 'div'}}
>>> annotated = annotate(s_text, s1_elements, config=cfg)
>>> annotated2 = annotate(annotated, s11_elements, config=cfg)
>>> cfg3 = cfg.copy()
>>> cfg3['markup'] = {'tag': {'tag': 'span'}}
>>> annotated3 = annotate(annotated2, s2_elements, config=cfg3)

The text annotated3 looks like this:

<div itemscope itemtype="http://schema.org/Person">
    <span itemprop="name">Angela Merkel</span>,
    <div itemscope itemtype="http://schema.org/Organization">
        <span itemprop="name">CDU</span>
    </div>,
    <span itemprop="jobtitle">Bundeskanzlerin</span>
</div>

Installation

To install Anchorman, simply:

pip install anchorman

Credits and contributions

We published this at github and pypi to provide our solution to you. Pleased for feedback and contributions.

Thanks Tarn Barford for inspiration and first steps.

Todo

  • check if position is in input, take care and save some processing

  • more schema.org examples

  • implement an original text/key replacement logic (kicked value, value_key)

  • check context of replacement: do not add links in links, or inline of overlapping elements, …

  • replace only one item of an entity > e.g. A. Merkel, Mum Merkel, …

  • implement a replacement logic for coreference chains

Bitdeli badge

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

anchorman-0.1.4.tar.gz (5.6 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page