shoten

...

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

Helper functions to find word trends (i.e. extract tokens, lemmatize and filter).

Installation

pip/pip3 install -U git+https://github.com/adbar/shoten.git

Usage

Input

Two possibilities for input data:

XML-TEI files as generated by trafilatura:
1. from shoten import gen_wordlist
2. myvocab = gen_wordlist(mydir, ['de', 'en'])
TSV-file contaning a word list: word form + TAB + date (YYYY-MM-DD format) + possible 3rd column (source)
1. from shoten import load_wordlist
2. myvocab = load_wordlist(myfile, ['de', 'en'])

Language codes: optional list of languages to be considered for lemmatization, ordered by relevance. ISO 639-1 codes, see the list of supported languages.

Optional argument maxdiff: maximum number of days to consider (default: 1000, i.e. going back up to 1000 days from today).

Filters

from shoten.filters import *

hapax_filter(myvocab, freqcount=2): (default frequency: <= 2)
shortness_filter(myvocab, threshold=20): length threshold in percent of word lengths
frequency_filter(myvocab, max_perc=50, min_perc=.001): maximum and minimum frequencies in percent
oldest_filter(myvocab, threshold=50): discard the oldest words (threshold in percent)
freshness_filter(myvocab, percentage=10): keep the X% freshest words
ngram_filter(myvocab, threshold=90, verbose=False): retains X% words based on character n-gram frequencies; runs out of memory if the vocabulary is too large (8 GB RAM recommended)
sources_freqfilter(myvocab, threshold=2): remove words which are only present in less than x sources
sources_filter(myvocab, myset): only keep the words for which the source contains a string listed in the input set
wordlist_filter(myvocab, mylist, keep_words=False): keep or discard words present in the input list

Reduce vocabulary size with a filter:

myvocab = oldest_filter(myvocab)

They can be chained:

myvocab = oldest_filter(shortness_filter(myvocab))

Output

# print one-by-one
for word in sorted(myvocab):
    print(word)
# transfer to a list
results = [w for w in myvocab]

CLI

shoten --help

Additional information

Shoten = focal point in Japanese (焦点).

Project webpage: Webmonitor.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.0

Mar 17, 2022

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

shoten-0.1.0.tar.gz (29.9 kB view hashes)

Uploaded Mar 17, 2022 Source

Built Distribution

shoten-0.1.0-py3-none-any.whl (25.9 kB view hashes)

Uploaded Mar 17, 2022 Python 3

Hashes for shoten-0.1.0.tar.gz

Hashes for shoten-0.1.0.tar.gz
Algorithm	Hash digest
SHA256	`c842a31fa261563a34243a0a778723619425990fb7624c414730e2021791f285`
MD5	`1df43badfc0180f55d3c568207bb12cf`
BLAKE2b-256	`bf10e83b5dbe5241de4acd5c4d847559f3637f18da535c5cfcfef5523cf6c2ed`

Hashes for shoten-0.1.0-py3-none-any.whl

Hashes for shoten-0.1.0-py3-none-any.whl
Algorithm	Hash digest
SHA256	`1ccb2c3a7cdd01fea74fe2329246cc3b2e29c84ae3332b7af6dc553b51b276a5`
MD5	`55a663c4ffcbdf34153598d205fedb64`
BLAKE2b-256	`6386a6dfeb02cd178ebd863898d5c4b681b8ac90b5a9fe08df9ceaf8eaae708f`