Skip to main content

...

Project description

Python package Python versions Code Coverage

Helper functions to find word trends (i.e. extract tokens, lemmatize and filter).

Installation

pip/pip3 install -U git+https://github.com/adbar/shoten.git

Usage

Input

Two possibilities for input data:

  • XML-TEI files as generated by trafilatura:
    1. from shoten import gen_wordlist

    2. myvocab = gen_wordlist(mydir, ['de', 'en'])

  • TSV-file contaning a word list: word form + TAB + date (YYYY-MM-DD format) + possible 3rd column (source)
    1. from shoten import load_wordlist

    2. myvocab = load_wordlist(myfile, ['de', 'en'])

Language codes: optional list of languages to be considered for lemmatization, ordered by relevance. ISO 639-1 codes, see the list of supported languages.

Optional argument maxdiff: maximum number of days to consider (default: 1000, i.e. going back up to 1000 days from today).

Filters

from shoten.filters import *

  • hapax_filter(myvocab, freqcount=2): (default frequency: <= 2)

  • shortness_filter(myvocab, threshold=20): length threshold in percent of word lengths

  • frequency_filter(myvocab, max_perc=50, min_perc=.001): maximum and minimum frequencies in percent

  • oldest_filter(myvocab, threshold=50): discard the oldest words (threshold in percent)

  • freshness_filter(myvocab, percentage=10): keep the X% freshest words

  • ngram_filter(myvocab, threshold=90, verbose=False): retains X% words based on character n-gram frequencies; runs out of memory if the vocabulary is too large (8 GB RAM recommended)

  • sources_freqfilter(myvocab, threshold=2): remove words which are only present in less than x sources

  • sources_filter(myvocab, myset): only keep the words for which the source contains a string listed in the input set

  • wordlist_filter(myvocab, mylist, keep_words=False): keep or discard words present in the input list

Reduce vocabulary size with a filter:

myvocab = oldest_filter(myvocab)

They can be chained:

myvocab = oldest_filter(shortness_filter(myvocab))

Output

# print one-by-one
for word in sorted(myvocab):
    print(word)
# transfer to a list
results = [w for w in myvocab]

CLI

shoten --help

Additional information

Shoten = focal point in Japanese (焦点).

Project webpage: Webmonitor.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shoten-0.1.0.tar.gz (29.9 kB view hashes)

Uploaded Source

Built Distribution

shoten-0.1.0-py3-none-any.whl (25.9 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page