Skip to main content

Single Amino Acids Polymorphism Statistics

Project description

SAAPS

Single Amino Acids Polylimorphism Statistics (SAAPS) is a Python3-based character interface program that can be used and deployed on multiple systems, including Windows, Linux, and MacOS. The software is mainly used to quickly calculate single amino acid polymorphism (SAP), Shannon information entropy at each site of the sequence, and to analyze sequence features and cluster analysis by using Onehot encoding and dimensionality reduction algorithms.

The raw input file of the SAAPS software is very simple, as long as the sequence file after alignment (nucleotide or amino acid), it can be used for processing and analysis. The main workflow of SAAPS is as follows: Firstly, by analyzing the characteristics of each sequence, the SAPs and Shannon information entropy of each point are calculated, and the corresponding result figures can be drawn at the same time. Then SAAPS screens the key sites according to the specified information entropy range, and transforms them into quantificable data by using Onehot encoding. Finally, the certain dimensionality reduction algorithm (PCA or t-SNE) is used to calculate the similarity level between sequences and draw the clustering scatter plot.

1. Download and install

SAAPS is developed via Python 3, and you can get and install in following ways.

1.1 Pip method

SAAPS has been distributed to the standard library pf PYPI, and can be easily installed by the tool pip.

pip install saaps
saaps -h

1.2 Local installation

In addition to the pip method, you can download and install it using the file setup.py. You need to download the repository, and then run:

python setup.py install
saaps -h

2. Getting help

You can use "-h" or "--help" to get the help document. The following is a brief introduction to the each parameter of SAAPS:

Input Options

Parameter Description
-i, --input the path of input file
-sr, --shannon_result the path of shannon result

Basic Options

Parameter Description
-dg, --delete_gap delete the columns containing gaps in sequences
-t, --translate translate nucleotides to amino acids
-cs, --cut_start the start fragment/site of target sequence
-ce, --cut_end the end fragment/site of target sequence
-osn, --output_seq_name output sequence names
-csn, --change_seq_name change the sequence names

Polymorphism Options

Parameter Description
-cp, --compute_ppm compute polymorphism
-pp, --plot_ppm plot polymorphism figure

Plotting Options

Parameter Description
-fw, --figure_width the width of figure
-fh, --figure_height the height of figure
-fname, --figure_name the name of figure
-c, --color the palette of figure
-pac, --print_all_colors print all sipported palettes
-dpi, --DPI the dpi of the figure
-fm, --figure_format the format of figure
-tp, --transparent make the background of the figure transparent
-sl, --scatter_label display scatter plot labels

Shannon Entropy Options

Parameter Description
-sha, --shannon compute shannon entropy
-smax, --shannon_max the maximum of the shannon value (IC)
-smin, --shannon_min the minimum of the shannon value (IC)
-pic, --plot_ic plot IC scatter figure

Dimension Reduction Options

Parameter Description
-oh, --onehot OneHot Encoding
-pca, --PCA dimensionality reduction by PCA
-tsne, --TSNE dimensionality reduction by t-SNE
-tpp, --tSNE_perplexity the perplexity in t_SNE, default is 3
-tlr, --tSNE_learn_rate the learning rate in t_SNE, default is 200
-tni, --tSNE_n_iter the n_iter in t_SNE, default is 5000
-trs, --tSNE_random_state the random state in t_SNE, default is 1
-pc, --plot_cluster plot clustering scatter plot

Output Options

Parameter Description
-o, --output_dir the directory path of output file, the default is the current user document folder
-pre, --prefix the prefix of the output file name

3. Examples

This chapter uses the test data ('testSeq.txt' in the test folder) provided by the software as an example to describe the main functions of SAAPS. Note that all sample files in this manual are generated by the software using test data (" testSeq.txt "), and users can follow the tutorial to learn it before exploring other features on their own.

3.1 Preparation for raw materials

Users needs to prepare an alignment sequence file (nucleotide or amino acid file), the alignment process can be completed by MAFFT and other software, and the format of the sequence file after alignment should be "sequence name + (line/multiple lines) sequence body" structure. The aligned sequence file should look like this:

>TestSeqNo.1 Sequence1  
GPQSGAIYVG……LWLDEEAMEQ
>TestSeqNo.2 Sequence2
GPLGQQSGAV……LWLDDEVMEQ
...

3.2 Sequence preprocessing

Before the subsequent analysis, it is usually necessary to format the sequence, such as whether the nucleotide sequence needs to be translated into amino acids, whether the sequence name needs to be changed, and so on. This allows us to get the best results in the subsequent analysis and drawing. Here we use the example of modifying the sequence name:

3.2.1 Output sequence names

saaps -i your/seq/path -osn -o your/output/path

Code explaination: -i your/seq/path indicates that the matched sequence file is passed. -osn indicates the sequence name in the output sequence file. By default, the output is in a .csv table. -o your/output/path indicates the output path. If this parameter is not specified, the result will be output in the default path.After the program is executed, a file named SeqName.csv will be generated, which can be used for users to modify the sequence names individually. It is recommended to save the modified new names in the second column, so that subsequent programs can modify the sequence names in batches. The format of the file is as follows:

Old_Name
>TestSeqNo.1 Sequence1
>TestSeqNo.1 Sequence2
...

3.2.2 Change sequence names

saaps -i your/seq/path -csn your/names/file/path -o your/output/path

Code explaination:In this step, -csn your/names/file/path represents the matching table of the old and new names in the sequence, which is the table output in the previous step. The pairing table format for the sequence names is as follows:

Old_Name New_Name
>TestSeqNo.1 Sequence1 >Seq1
>TestSeqNo.1 Sequence2 >Seq2
... ...

The modified sequence file is named "NewNameSeq.txt" by default. The comparison between the original sequence file and the modified sequence file is as follows:

Old Seq File New Seq File
>TestSeqNo.1 Sequence1 >Seq1
GPQSGAIYVG……LWLDEEAMEQ GPQSGAIYVG……LWLDEEAMEQ
... ...

3.3 Polymorphism analysis

Although SAAPS was developed only for the analysis of SAP, the basis of polymorphism analysis is statistical analysis of the symbols and frequencies of each site in the sequence, so SAAPS can also be applied to sequences such as nucleotides.

saaps -i your/seq/path -cp -pp -o your/output/path

Code explaination:-csn your/names/file/path represents the matching table of the old and new names in the sequence, which is the table output in the previous step. -cp represents to activate the computational polymorphism function. -pp represents that the polymorphism result is plotted.

The statistical result of polymorphism analysis will generate three files named Concise_Result.txt, Detail_Result.txt, and Seq_Matrix.csv. If the drawing parameter (-pp) is set, A resulting graph (default) named SAPs.pdf is generated accordingly. The main result files are explained as follows:

Concise_Result.txt: a concise result of polymorphism analysis, which mainly records the frequency of amino acids at each site, and is also an input file for the subsequent calculation of information entropy at each point. The format is shown as follows (part) :

Site 1	(G, 15, 100.00%)			
Site 2	(K, 7, 46.67%)	(R, 1, 6.67%)	(C, 1, 6.67%)	…
…	…	…	…	…

Detail_Result.txt: detailed result of polymorphism analysis, which records the sequence name of each amino acid at each site. Its file content is shown below (part) :

col1 col2 col3 coln
Site1 (G: All Sequences)
Site2 (P: >Seq2,>Seq1) (K: >Seq10,>Seq15,>Seq7,>Seq13,>Seq9,>Seq12,>Seq14) ...
Site3 (F: Most Sequences) (L: >Seq2) ...
... ... ... ...

Seq_Matrix.csv: indicates the sequence amino acid matrix. The main contents of the table are as follows (part) :

Seq1 Seq2 Seq3 Seq4 Seq5 Seq6 Seq7 ...
G G G G G G G ...
P P A A A A K ...
... ... ... ... ... ... ... ...

SAPs.pdf: Polymorphism distribution map. The example figure is shown below, and users can customize the resulting graph according to the drawing parameters provided by the software:

SAPs.png

3.4 Shannon entropy calculation

In the SAAPS calculation results, information content (IC value) is used to replace information entropy, and the higher the IC value, the more conservative the amino acid at the site. During SAAPS information entropy analysis, two result files, Shannon_log.txt and Shannon_IC_Result.csv, are generated by default.

saaps -i your/Concise/result/path -sha -pic -o your/output/path

Code explaination: When calculating information entropy, we need to pass the Concise_Result.txt file in the previous step of polymorphism analysis, -sha means to activate the information entropy calculation function, -pic means to draw the information entropy result, and a picture file named IC-Plot.pdf is generated in the output path by default.

Shannon_log.txt: indicates the simple statistical result of IC analysis. The contents of the document are as follows (part) :

Basic Statistic Information of the Shannon Entropy (IC)
 IC
Count  150
mean  3.8647165333333335
...  ...

Shannon_IC_Result.csv: IC values of all sites in sequence. The table contains all site IC values, which are convenient for the subsequent drawing of IC figure. The contents of the table are as follows (part) :

SITE IC
Site1 4.45943
Site2 2.52919
Site3 3.75947
Site4 3.75947
... ...

IC-Plot.pdf: Sequence IC scatter figure. The figure is as follows:

IC-plot

3.5 Polymorphism site screening and Onehot encoding

saaps -i your/Seq/Matrix/path -sr your/IC/Result/path -smin ICmin -smax ICmax -o your/output/path

Code explaination:During the Onehot encoding process, the Seq_Matrix.csv file from the polymorphism analysis in the previous step and the Shannon_IC_Result.csv file from the information entropy calculation need to be passed in. If only the sites within the threshold range need to be converted, You can use -smin and -smax to set the IC threshold range. In the Onehot encoding step, the software automatically generates 4 files: OneHot-Original.csv, One-Hot-ConciseMatrix.csv, OneHot-Transform.csv, and One-Hot-ForIntersect.csv.

OneHot-Original.csv: a table of polymorphic loci selected according to the IC threshold range and used for onehot encoding. The file content is shown as follows (part) :

3 4 5 6 7 8 ...
>Seq1 Q S G A I Y ...
>Seq2 L G Q Q S G ---
... ... ... ... ... ... ... ...

One-Hot-ConciseMatrix.csv: the process file after the Onehot encoding of OneHot-Original.csv, where each amino acid is represented by the corresponding binary number. The file content is shown as follows (part) :

3 ...
Seq1 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0] ...
Seq2 [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0] ...
... ... ...

OneHot-Transform.csv: the result of upgrading the dimensions according to the One-Hot-ConciseMatrix.csv file, in which the binary list of each point is expanded into the corresponding number dimensions, and each dimension only contains the corresponding values in the binary list. The main contents of the file are shown as follows (part) :

3_1 3_2 3_3 ..
Seq1 0 0 0 ...
Seq2 0 0 0 ...
... ... ... ... ...

One-Hot-ForIntersect.csv: indicates the record list of sequence polymorphism loci. The table records the amino acids of each sequence at the polymorphic site, which can be used for differential complement analysis of subsequent sets, etc. The contents of the table are as follows (part) :

3 4 5 6 7 ...
>Seq1 3Q 4S 5G 6A 7I ...
>Seq2 3L 4G 5Q 6Q 7S ...
>Seq3 3F 4G 5Q 6Q 7S ...
... ... ... ... ... ... ...

3.6 Dimensionality reduction and clustering

The function of Onehot encoding is to convert classified data into numerical data that can be quantitatively analyzed. After the onehot encoding conversion, the information of each location of each sequence is converted into high-dimensional binary data. Therefore, in order to better analyze the sequence features, we usually adopt a certain dimensionality reduction algorithm to analyze it and observe the relationship between each sequence. Here we take the most commonly used PCA algorithm as an example.

saaps -i your/Transform/result/path -pca -pc -o your/output/path

Code explaination:In the process of dimensionality reduction analysis, the OneHot-Transform.csv file in the previous step needs to be passed in; -pca indicates that the PCA dimensionality reduction function is activated. -pc means to draw a scatter plot based on the dimensionality reduction results. This process produces two result files called PCA.csv and PCA-Plot.pdf.

PCA.csv: PCA calculation result of sequence file. The contents of the document are as follows (part) :

0 1 2 3 ...
Seq1 6.254129 -2.01068 -0.27758 -0.33611 ...
Seq2 -0.74108 -0.47749 -0.1452 -1.75072 ...
... ... ... ... ... ...

PCA-plot.pdf: Based on the PCA analysis results, the scatter Plot is generated according to the default parameters, and the example figure is as follows:

PCA-plot.png

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

saaps-1.4.1.tar.gz (892.5 kB view hashes)

Uploaded Source

Built Distribution

saaps-1.4.1-py3-none-any.whl (888.1 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page