A Python package used to analysis Protein Sequence Activity Relationships

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Project description

alt text

pySAR

PyPI Platforms

Introduction
Requirements
Installation
Usage
Directory Folders
Tests
Contact

Introduction

pySAR is a Python library for analysing Sequence Activity Relationships (SARs) of protein sequences. pySAR offers extensive and verbose functionalities that allow you to numerically encode a dataset of protein sequences using a large abundance of available methodologies and features. The software uses physiochemical and biochemical features from the Amino Acid Index (AAI) database as well as allowing for the calculation of a range of structural protein descriptors.
After finding the optimal technique and feature set at which to encode your dataset of sequences, pySAR can then be used to build a predictive regression model with the training data being that of the encoded sequences and training labels being the experimentally pre-calculated activity values for each protein sequence. The model can then be used to predict the activity/fitness value of a new unseen sequence.

Requirements

Python >= 3.6
numpy >= 1.16.0
pandas >= 1.1.0
sklearn >= 0.24
scipy >= 1.4.1
tqdm >= 4.55.0
seaborn >= 0.11.1

Installation

Install the latest version of pySAR using pip:

pip3 install pySAR

Install by cloning repository:

git clone https://github.com/amckenna41/pySAR.git
python3 setup.py install

Usage

Building predictive model from AAI and or protein descriptors:

e.g: the below code will build a PlsRegression model using the AAI index CIDH920105 and the 'amino acid composition' descriptor. The index is passed through a DSP pipeline and is transformed into its informational protein spectra using the power spectra, with a hamming window function applied to the output of the FFT. The concatenated features from the AAI index and the descriptor will be used as the feature data in building the PLS model.

#import pySAR package
import pySAR as pysar

#create instance of PySAR class
pySAR = pysar.PySAR(dataset="dataset.txt",activity="activity",algorithm="PlsRegression")
"""
PySAR parameters:

dataset : str (default = "")
    full path to dataset or name of dataset if it is stored in DATA_DIR.
seq_col : str (default = "sequence")
    name of column in dataset that stores the protein sequences. By default
    the class will look for a column called 'sequence'.
activity : str (default = "")
    name of activity column in dataset.
algorithm : str (default = "")
    name of regression model to use for building the predictive models, class
    will accept full name or approximate name of model e.g "PLSReg", "plsregg" and
    "PLSRegression" will all build a PlsRegression model.
parameters : dict (default = {})
    dictionary of parameters to use for the predictive model. By default the
    default parameters of the model will be used.
test_split : float (default = 0.2)
    specifies the proportion of the dataset to use for testing. By default a
    80:20 split will be used, meaning 80% of the data will be used for training
    and 20% for testing.
descriptors_csv : str (default = "descriptors.csv")
    csv file storing the pre-calculated descriptor values for the sequences
    in the dataset. By default the class will look for a file named
    "descriptors.csv" in the DATA_DIR and will use its contents as the
    descriptor features, instead of having to recalculate all descriptors for the dataset.
"""
#encode protein sequences using both the CIDH920105 index + aa_composition descriptor.
results_df = pySAR.encode_aai_desc(indices="CIDH920105", descriptors="aa_composition",
  spectrum="power", window="hamming")

Output results showing AAI index and its category as well as all the associated metric values for each predictive model:

	Index	Category	R2	RMSE	MSE	RPD	MAE	Explained Var
0	CHOP780206	sec_struct	0.62737	3.85619	14.8702	1.63818	3.16755	0.713467

Encoding using all 566 AAIndex indices

Encoding protein sequences in dataset using all 566 indices in the AAI database. At each iteration, the encoded sequences using the indices from the AAI will be used to generate an protein spectra using the imaginary spectrum with a blackman window function applied, this will then be used as feature data to build a predictive model that can be used for accurate prediction of the sought activity value of unseen protein sequences. The output results will show the calculated metric values when measuring predicted vs observed activity values for the test sequences.

from pySAR.encoding import *

#create instance of Encoding class, inherits from pySAR class, using RandomForest algorithm
#   with 200 estimators and a max_depth of 50.
encoding = Encoding(dataset="dataset.txt", activity="activity_col",
  algorithm="RandomForest", parameters={"n_estimators":"200","max_depth":"50"})

#encode sequences using all indices in the AAI
aai_encoding = encoding.aai_encoding(spectrum='imaginary', window='blackman')

Output results showing AAI index and its category as well as all the associated metric values for each predictive model:

	Index	Category	R2	RMSE	MSE	RPD	MAE	Explained Var
0	CHOP780206	sec_struct	0.62737	3.85619	14.8702	1.63818	3.16755	0.713467
1	QIAN880131	sec_struct	0.626689	3.90576	15.255	1.63668	3.09849	0.631582
2	QIAN880118	sec_struct	0.625156	3.99581	15.9665	1.63333	3.32038	0.625897
3	PRAM900104	sec_struct	0.615866	3.90389	15.2403	1.61346	3.24906	0.617799
..	..........	..........	........	.......	.......	.......	.......	...............

Encoding using list of 4 AAI indices, with no DSP functionalities

Same procedure as prior, except 4 indices from the AAI are being specifically input into the function, with the encoded sequence output being concatenated together and used as feature data to build the predictive PlsRegression model with its default parameters.

from pySAR.encoding import *

encoding = Encoding(dataset="dataset.txt", activity="activity_col",
  algorithm="PLSRegression", parameters={})

#encode sequences using 4 indices specified by user
aai_encoding = encoding.aai_encoding(use_dsp=False, aai_list=["PONP800102","RICJ880102","ROBB760107","KARS160113"])

Encoding protein sequences using their calculated protein descriptors

Calculate the protein descriptor values for a dataset of protein sequences from the 15 available descriptors in the descriptors module. Use each descriptor as a feature set in the building of the predictive models used to predict the activity value of unseen sequences. By default, function will look for a file called 'descriptors.csv' that contains the pre-calculated descriptor values for a dataset, this filename can be changed according to descriptors_csv input parameter, if file is not found then all descriptor values will be calculated for dataset.

from pySAR.encoding import *
#create instance of Encoding class using AdaBoost algorithm, using 100 estimaors & a learning rate of 1.5
encoding = Encoding(dataset="dataset.txt", activity="activity_col",algorithm="AdaBoost",   
       parameters={"n_estimators":100,"learning_rate":1.5}, descriptors_csv="descriptors.csv")

#building predictive models using all available descriptors
#   calculating evaluation metrics values for models and storing into desc_results_df DataFrame
desc_results_df = encoding.descriptor_encoding()

Output results showing the protein descriptor and its group as well as all the associated metric values for each predictive model:

	Descriptor	Group	R2	RMSE	MSE	RPD	MAE	Explained Var
0	_distribution	CTD	0.721885	3.26159	10.638	1.89621	2.60679	0.727389
1	_geary_autocorrelation	Autocorrelation	0.648121	3.67418	13.4996	1.68579	2.82868	0.666745
2	_tripeptide_composition	Composition	0.616577	3.3979	11.5457	1.61496	2.53736	0.675571
3	_aa_composition	Composition	0.612824	3.37447	11.3871	1.60711	2.79698	0.643864
4	......	......	......	......	......	......	......	......

Encoding using AAI + protein descriptors

Encoding protein sequences in dataset using all 566 indices in the AAI database combined with protein descriptors. All 566 indices can be used in concatenation with 1, 2 or 3 descriptors. E.g: at each iteration the encoded sequences using the indices from the AAI will be used to generate an protein spectra using the power spectrum with no window function applied, this will then be combined with the feature set generated from the dataset's descriptor values and used to build a predictive model that can be used for accurate prediction of the sought activity value of unseen protein sequences. The output results will show the calculated metric values when measuring predicted vs observed activity values for the test sequences.

from pySAR.encoding import *
#create instance of Encoding class using RF algorithm, using 100 estimaors with a learning rate of 1.5
encoding = Encoding(dataset="dataset.txt", activity="activity_col",algorithm="AdaBoost",   
       parameters={"n_estimators":100,"learning_rate":1.5}, descriptors_csv="descriptors.csv")

#building predictive models using all available aa_indices + combination of 2 descriptors,
#   calculating evaluation metrics values for models and storing into aai_desc_results_df DataFrame
aai_desc_results_df = encoding.aai_descriptor_encoding(desc_combo=2, spectrum='power', window=None)

Output results showing AAI index and its category, the protein descriptor and its group as well as the R2 and RMSE values for each predictive model:

	Index	Category	Descriptor	Descriptor Group	R2	RMSE
0	ARGP820103	composition	_conjoint_triad	Conjoint Triad	0.72754	3.22135
1	ARGP820101	hydrophobic	_quasi_seq_order	Quasi-Sequence-Order	0.722284	3.30995
2	ARGP820101	hydrophobic	_seq_order_coupling_number	Quasi-Sequence-Order	0.722158	3.34926
3	ANDN920101	observable	_seq_order_coupling_number	Quasi-Sequence-Order	0.70826	3.25232
4	.....	.....	.....	.....	.....	.....

Generate all protein descriptors

Functionality to calculate ALL 15 descriptor values for a dataset of protein sequences. Output values are stored in dataset set by desc_dataset input parameter. Output will be of the shape N x 9920, where N is the number of protein sequences in the dataset and 9920 is the total number of features calculated from all 15 descriptors. Moving forward pySAR will pull the descriptor values from this file rather than recalculating them.

from pySAR.descriptors import *

#calculating all descriptor values and storing in file names 'descriptors.csv'
#     all_desc = True means that all descriptors will be calculated, it is False by default.
desc = Descriptor(protein_seqs = data, desc_dataset = "descriptors.csv",
    all_desc=True)

Get record from AAIndex database

The AAIndex class offers diverse functionalities for obtaining any element from any record in the database. Each record is stored in json format in a class attribute called aaindex_json. You can search for a particular record by its index code, description or reference. You can also get the index category and importantly its associated amino acid values.

import pySAR.aaindex as aaindex

#create AAIndex object
aai = aaindex.AAIndex()

record = aai.get_record_from_code('CHOP780206')   #get full AAI record
category = aai.get_category_from_record('CHOP780206') #get record's category
values = aai.get_values_from_record('CHOP780206')    #get amino acid values from record
refs = aai.get_ref_from_record('CHOP780206')      #get references from record
num_record = aai.get_num_records()                #get total number of records
record_names = aai.get_record_names()             #get list of all record names

Directories

/pySAR/PyBioMed - package partially forked from https://github.com/gadsbyfly/PyBioMed, used in the calculation of the protein descriptors.
/Results - stores all calculated results that were generated for the research article, studying the SAR for a thermostability dataset.
/pySAR/tests - unit and integration tests for pySAR.
/pySAR/data - all required data and datasets are stored in this folder.

Tests

To run all tests, from the main pySAR folder run:

python3 -m unittest discover

To run tests for specific module, from the main pySAR folder run:

python -m unittest tests.MODULE_NAME -v

Contact

If you have any questions or comments, please contact amckenna41@qub.ac.uk or raise an issue on the Issues tab.

Project details

These details have not been verified by PyPI

Project links

Homepage

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Release history Release notifications | RSS feed

2.4.3

Nov 23, 2023

2.4.2

Nov 16, 2023

2.4.1

Nov 8, 2023

2.4.0

Oct 31, 2023

2.3.6

Oct 19, 2023

2.3.5

Oct 17, 2023

2.3.4

Apr 16, 2023

2.3.3

Apr 4, 2023

2.3.2

Mar 26, 2023

2.3.1

Mar 25, 2023

2.3.0

Mar 24, 2023

2.2.2

Mar 8, 2023

2.2.1

Mar 8, 2023

2.2.0

Mar 8, 2023

2.1.5

Mar 7, 2023

2.1.4

Mar 6, 2023

2.1.3

Feb 25, 2023

2.1.2

Feb 23, 2023

2.1.1

Feb 23, 2023

2.1.0

Feb 20, 2023

2.0.6

Jul 16, 2022

2.0.5

Mar 31, 2022

2.0.4

Mar 3, 2022

2.0.3

Mar 3, 2022

2.0.2

Mar 2, 2022

2.0.1

Feb 19, 2022

2.0.0

Feb 6, 2022

1.0.1

May 9, 2021

1.0.0

May 9, 2021

0.2.0

May 15, 2021

0.1.0

May 15, 2021

This version

0.0.9

May 9, 2021

0.0.8

May 9, 2021

0.0.7

May 9, 2021

0.0.3

May 7, 2021

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pySAR-0.0.9.tar.gz (96.9 kB view hashes)

Uploaded May 9, 2021 Source

Built Distributions

pySAR-0.0.9-py3.8.egg (237.3 kB view hashes)

Uploaded May 9, 2021 Source

pySAR-0.0.9-py3-none-any.whl (108.5 kB view hashes)

Uploaded May 9, 2021 Python 3

Hashes for pySAR-0.0.9.tar.gz

Hashes for pySAR-0.0.9.tar.gz
Algorithm	Hash digest
SHA256	`645730b52dd6ab2b4f1c29d92c27d8e612edfbfedcdbab24c440cce08ec4b223`
MD5	`81724069d50e5c8c0087785db3ff0714`
BLAKE2b-256	`3a1566fda71a67252accb6f76df0ee80ae03417e32788b7d51bbc01c52490ceb`

Hashes for pySAR-0.0.9-py3.8.egg

Hashes for pySAR-0.0.9-py3.8.egg
Algorithm	Hash digest
SHA256	`19f3f4281fc6a9da9223d6ea1acff319011ab731ef20d6e2d397fa357868ef87`
MD5	`ce091b21c3f2561aa76eac34cc12ae08`
BLAKE2b-256	`c0b676d9d0a8a6d8df9f4a71c9499d66cefe4e7f23c42b946956c7efc8186559`

Hashes for pySAR-0.0.9-py3-none-any.whl

Hashes for pySAR-0.0.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`cd75ed6c9cc2f539a2a0ed2c812470b2a1a692804892c495806cd716dd229ad3`
MD5	`cc317018904e7bf3b38f7a1ccf13f1ef`
BLAKE2b-256	`ad44358f7a89300f822c6a2297319db296474365ddd362d77b5438090e2dde51`