EAST

Text analysis library based on the Annotated Suffix Tree method

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

EAST stands for Enhanced Annotated Suffix Tree method for text analysis.

How to

Keyphrases table

The basic use case for the AST method is to calculate matching scores for a set of keyphrases against a set of text files (the so-called keyphrase table). To do that with east, launch it as follows:

python -m east.main [-s] [-d] [-a <ast_algorithm>] keyphrases table <keyphrases_file> <directory_with_txt_files>

The -s option stands for synonyms and determines whether the matching score should be computed taking into account the synonyms extracted from the text file.
The -d option stands for denormalized and specifies whether the the matching score should be computed in the denormalized form (normalized by default, see [Mirkin, Chernyak & Chugunova, 2012].
The -a option stands for algorithm and defines the actual AST method implementation to be used. Possible arguments are “easa” (Enhanced Annotated Suffix Arrays) and “ast_linear” (Linear-time and -memory implementation of Annotated Suffix Trees).
Please note that you can also specify the path to a single text file instead of that for a directory. In case of the path to a directory, only .txt files will be processed.

The output is in an XML-like format:

<keyphrase value="KEYPHRASE_1">
  <text name="TEXT_1">0.250</text>
  <text name="TEXT_2">0.234</text>
</keyphrase>
<keyphrase value="KEYPHRASE_2">
  <text name="TEXT_1">0.121</text>
  <text name="TEXT_2">0.000</text>
</keyphrase>
<keyphrase value="KEYPHRASE_3">
  <text name="TEXT_1">0.539</text>
  <text name="TEXT_3">0.102</text>
</keyphrase>

Keyphrases graph

The east software also allows to construct a keyphrases relation graph, which indicates implications between different keyphrases according to the text corpus being analysed. The graph construction algorithm is based on the analysis of co-occurrences of keyphrases in the text corpus. A keyphrase is considered to imply another one if that second phrase occurs frequently enough in the same texts as the first one (that frequency is controlled by the significance level parameter). A keyphrase counts as occuring in a text if its presence score for that text ecxeeds some threshold [Mirkin, Chernyak, & Chugunova, 2012].

python -m east.main [-s] [-d] [-a <ast_algorithm>] [-l significance_level] [-t score_threshold] keyphrases graph <keyphrases_file> <directory_with_txt_files>

The -s, -d and -a options configure the algorithm of computing the matching scores (exactly as for the keyphrases table command).
The -l option stands for level of significance and controls the significance level above which the implications between keyphrases are considered to be strong enough to be added as graph arcs. The significance level should be a float in [0; 1] and is 0.6 by default.
The -t option stands for threshold of the matching score and controls the minimum matching score value where keyphrases start to be counted as occuring in the corresponding texts. It should be a float in [0; 1] and is 0.25 by default.

The output is the set of arcs of the graph, which are essentially implications between keyphrases:

KEYPHRASE_1 -> KEYPHRASE_3
KEYPHRASE_2 -> KEYPHRASE_3
KEYPHRASE_2 -> KEYPHRASE_4
KEYPHRASE_4 -> KEYPHRASE_1

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

0.3.8

Apr 22, 2015

0.3.7

Apr 2, 2015

0.3.6

Apr 2, 2015

0.3.5

Apr 2, 2015

0.3.4

Mar 29, 2015

0.3.3

Mar 29, 2015

0.3.2

Feb 10, 2015

0.3.1

Nov 19, 2014

0.3.0

Nov 19, 2014

0.2.9

Nov 17, 2014

0.2.8

Nov 10, 2014

0.2.7

Nov 10, 2014

0.2.6

Nov 10, 2014

0.2.5

Oct 23, 2014

0.2.4

Oct 22, 2014

0.2.3

Oct 22, 2014

0.2.2

Jun 17, 2014

0.2.1

Jun 17, 2014

0.2

Jun 6, 2014

0.1.2

May 27, 2014

0.1.1

May 27, 2014

This version

0.1

May 27, 2014

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

EAST-0.1.tar.gz (17.8 kB view hashes)

Uploaded May 27, 2014 Source

Hashes for EAST-0.1.tar.gz

Hashes for EAST-0.1.tar.gz
Algorithm	Hash digest
SHA256	`74642fd3194841d9a6d379ed06237980fd5391b95bbf4a3c1520ef9d6302783f`
MD5	`a57bdb7dd4ff4cb0ec75cdcdcdf7d2cc`
BLAKE2b-256	`d3661f6ec7b02eb82dcc976e045f71d4ffde59f96fbaab2feb0c318449947174`