Project description

Topic Context Modell (TCM)

Calculates the surprisal of a word given a context.

Tests

Requirements

Python >= 3.10
scipy
scikit-learn

Usage

$ python tcm.py -h
usage: tcm [-h] [-V] [-m {lda,lsa}] [--model-file MODEL_FILE] [--data DATA [DATA ...]]
           [--fields FIELDS [FIELDS ...]] [--words WORDS] [--n-components N_COMPONENTS]
           [--doc-topic-prior DOC_TOPIC_PRIOR] [--topic-word-prior TOPIC_WORD_PRIOR]
           [--learning-method LEARNING_METHOD] [--learning-decay LEARNING_DECAY] [--learning-offset LEARNING_OFFSET]
           [--max-iter MAX_ITER] [--batch-size BATCH_SIZE] [--evaluate-every EVALUATE_EVERY] [--perp-tol PERP_TOL]
           [--mean-change-tol MEAN_CHANGE_TOL] [--max-doc-update-iter MAX_DOC_UPDATE_ITER] [--n-jobs N_JOBS]
           [--random-state RANDOM_STATE] [-v] [--log-format LOG_FORMAT] [--log-file LOG_FILE]
           [--log-file-format LOG_FILE_FORMAT]
           {train,surprisal} [{train,surprisal} ...]

positional arguments:
  {train,surprisal}     what to do, train lda/lsa or calculate surprisal.

options:
  -h, --help            show this help message and exit
  -V, --version         show program's version number and exit
  -m {lda,lsa}, --model {lda,lsa}
                        which model to use. (default: lda)
  --model-file MODEL_FILE
                        file to load model from or save to, if path exists tries to load model. (default: lda.jl.z)
  --data DATA [DATA ...]
                        file(s) to load texts from, either txt or csv optionally gzip compressed. (default: None)
  --fields FIELDS [FIELDS ...]
                        field(s) to load texts when using csv data. (default: None)
  --words WORDS         file to load words from and/or save to, either txt or json optionally gzip compressed. (default: words.txt.gz)
  -v, --verbose         verbosity level; multiple times increases the level, the maximum is 3, for debugging. (default: 0)
  --log-format LOG_FORMAT
                        set logging format. (default: %(message)s)
  --log-file LOG_FILE   log output to a file. (default: None)
  --log-file-format LOG_FILE_FORMAT
                        set logging format for log file. (default: [%(levelname)s] %(message)s)

LDA config:
  --n-components N_COMPONENTS
                        number of topics. (default: 10)
  --doc-topic-prior DOC_TOPIC_PRIOR
                        prior of document topic distribution `theta`. If the value is None, defaults to `1 / n_components`. (default: None)
  --topic-word-prior TOPIC_WORD_PRIOR
                        prior of topic word distribution `beta`. If the value is None, defaults to `1 / n_components`. (default: None)
  --learning-method LEARNING_METHOD
                        method used to update `_component`. (default: batch)
  --learning-decay LEARNING_DECAY
                        it is a parameter that control learning rate in the online learning method. The value should be set between (0.5, 1.0] to guarantee asymptotic convergence. When the value is 0.0 and batch_size is `n_samples`, the update method is same as batch learning. In the literature, this is called kappa. (default: 0.7)
  --learning-offset LEARNING_OFFSET
                        a (positive) parameter that downweights early iterations in online learning.  It should be greater than 1.0. In the literature, this is called tau_0. (default: 10.0)
  --max-iter MAX_ITER   the maximum number of passes over the training data (aka epochs). (default: 10)
  --batch-size BATCH_SIZE
                        number of documents to use in each EM iteration. Only used in online learning. (default: 128)
  --evaluate-every EVALUATE_EVERY
                        how often to evaluate perplexity. Set it to 0 or negative number to not evaluate perplexity in training at all. Evaluating perplexity can help you check convergence in training process, but it will also increase total training time. Evaluating perplexity in every iteration might increase training time up to two-fold. (default: -1)
  --perp-tol PERP_TOL   perplexity tolerance in batch learning. Only used when `evaluate_every` is greater than 0. (default: 0.1)
  --mean-change-tol MEAN_CHANGE_TOL
                        stopping tolerance for updating document topic distribution in E-step. (default: 0.001)
  --max-doc-update-iter MAX_DOC_UPDATE_ITER
                        max number of iterations for updating document topic distribution in the E-step. (default: 100)
  --n-jobs N_JOBS       the number of jobs to use in the E-step. `None` means 1. `-1` means using all processors. (default: None)
  --random-state RANDOM_STATE
                        pass an int for reproducible results across multiple function calls. (default: None)

References

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

This version

0.1.3

May 2, 2024

0.1.2

Mar 21, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

topic_context_model-0.1.3.tar.gz (26.2 kB view hashes)

Uploaded May 2, 2024 Source

Built Distribution

topic_context_model-0.1.3-py3-none-any.whl (28.4 kB view hashes)

Uploaded May 2, 2024 Python 3

Hashes for topic_context_model-0.1.3.tar.gz

Hashes for topic_context_model-0.1.3.tar.gz
Algorithm	Hash digest
SHA256	`3a17a2fb22bb722f80dcd0bb6e8dbec25382963acc7d992c7e683f19c1b25955`
MD5	`7533b799ee09b3e3bf5439c18d6972bf`
BLAKE2b-256	`810bc7c8242875f9e3a1bddd41154074dde479a05d5b18f459c0103af5193bb6`

Hashes for topic_context_model-0.1.3-py3-none-any.whl

Hashes for topic_context_model-0.1.3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`a6c0d27bd3730caeacc2236f1ae35a06fee494617bd2d374bf4fa94144a960b7`
MD5	`f44a6f4d1c4d1f37f020cef466fa6f47`
BLAKE2b-256	`8f020683a0a7ab3509c0a7e75a7d84882ff2eca891d092da0c92b60e35824022`