Skip to main content

A set of scripts to ease the process of manipulating fasta files

Project description

fastaUtils

A set of scripts to ease the process of manipulating fasta files and building Multiple Sequence Alignments

Most functionalities have now been tested

Installation

pip install fastaUtils

Alternatively, clone the repository and simply add the following lines to your ~/.bashrc file

export PATH=$PATH:'<path_to_fastaUtils>/bin'

export PYTHONPATH=$PYTHONPATH:'<path_to_fastaUtils>'

Additional requirements

  • fst-download can be used to download the current version of uniprot (sprot+trembl) and metaclust databases. Even if folder where to save the databases can be passed to the program as an optional field, it is recommended to define the following environment variables: UNIPROTFOLDER and METACLUSTFOLDER, e.g. by adding export UNIPROTFOLDER='path fo folder' and export METACLUSTFOLDER='path fo folder' to your /.bashrc file.

  • fst-grep requires the regex python package for matches with substitutions (imperfect matches)

Usage

Most commands try to copy the functionality of their equivalent unix function. In all cases run the command with -h to list a detailed help page

For all commands that requires an input fasta file, please note that is this input is not provided the program will try to read from stdin. In this way multiple commands can be concatenated to perform complex operations on fasta files.

fst-awk : run a command on every sequence of the input fasta file.

The language in which the command must be provided is different from the one employed by awk. Here are the main features:

  • BEGIN and END blocks become \BEGIN and \END in fst-awk

  • curly backets {...} always indentify a code block. Code blocks can be nested together

  • \BEGIN{statement1} if statement2 {do something} \END{statement3} therefore is composed of a BEGIN code block, a main block that is executed only if statement2 is True and an END block. All blocks are optionals: by default the BEGIN and END blocks are empty, statement2 is True and the main block simply print the sequence in fasta format

  • all statements should be valid python code, with few exceptions: dictionaries cannot be used (as they would be interpreted as code blocks), curly brackets cannot appear in strings (as they would be interpreted as code blocks), instead of indenting code blocks curly brackets must be used, do not use colons before a code block, use semicolons to separate lines.

  • new variables are used: db,uid,name,beg,end,descr,pe,sv,gn,ox (for each field in the fasta header), NR (current sequence index), seq (header and sequence), seq.seq (sequence) and seq.header (fasta header). If information is not provided in the fasta header, pe,sv and ox will default to -1, and all other variables to -

Examples:

fst-awk '{print(len(seq.seq))}' fastafile will print the length of each sequence

fst-awk 'len(seq.seq)>100 {print(seq)}' fastafile will print a fasta file containing all sequences longer than 100 amino acids

fst-awk 'seq.seq.count("-")<10 {print(seq)}' fastafile will print a fasta file containing all sequences with no more than 10 gaps

fst-grep: search for substrings in a fasta file and return matching sequences.

Similarly to grep it accept the following options:

  • -c : return the number of matching sequences

  • -o : return a fasta file with all the matching parts of all sequences. Multiple matches for protein are possible. The fasta header is modified in order to add the information about the position of the match in the original sequence

  • -v : invert the match, aka return non matching sequences only. Non compatible with -o

fst-grep also accept -s option, specifying how many mismatching characters are acceptable. By default this number is zero.

Some rules/shortcut are available.

  • '_' matches all aminoacids, but not gaps

  • '.' matches gaps only

  • '+' and '-' match positive and negative amino acids

  • '@' matches hydrophobic amino acids

  • '#' matches polar amino acids

Example:

fst-grep 'HPD' fastafile return all sequences from fastafile with an HPD motif

fst-grep '-@@@@-' -c count the number of sequences with at least one negative-hydrofobic*4-negative motif

fst-grep 'HPD' -o fastafile -s 1 return all HPD motifs in fastafile, but also motifs in which at most one substitution is present (e.g. NPD)

fst-grep 'HPD' -v fastafile return all sequences from fastafile without an HPD motif

fst-grep 'HPD|NPD' fastafile return all sequences from fastafile with an HPD or NPD motif

fst-paste: append multiple sequences from different fasta files

The number of sequeces in the final fasta file is equal to the minimum number of sequences in the input files

Example:

fst-paste fasta1 fasta2 ...

fst-cut: remove columns from a fasta file

Similarly to cut, you can remove selected columns by passing -c beg-end col1 .. colN. Also accept -c beg- : in this case all columns from beg to the very last in the sequence are removed. The counting starts at 1, so fst-cut -c 1 fastafile will remove the first residue. If a range of column is used, the extrema are both removed. To remove all columns except for selected ones, use the -r flag.

fst-cut accept a rule to dynamically decide which column to remove. The rule is written in python and passed with -R flag. A profile/logo is required in order to have access to aminoacid frequencies.

Example:

fst-cut -c 1-3 5 fastafile print a fasta file in which columns 1,2,3 and 5 have been omitted

fst-cut -c 1-3 5 -r fastafile print a fasta file with only columns 1,2,3 and 5

fst-cut -R 'freq["-"]>0.5' -p profile fastafile print a fasta file where all columns with a frequency of gaps greater than 0.5 have been removed

fst-sort: reorder sequences

Reorder sequences (possibly in reverse order) according to a field in fasta header. Optionally the sequences can be shuffled (random order) or be ordered according to a predefined order (provided through an input file)

Example:

fst-sort -k uid fastafile reorder sequences in alphabetical order of uid

fst-sort -k uid -F orderfile fastafile reorder sequences to match orderfile. Orderfile must contain a list of uids (one per line). The output fasta will not contain sequences with an uid not in orderfile

fst-sort -R fastafile shuffle input fasta file

fst-sort -k uid -r fastafile reorder sequences in reverse alphabetical order of uid

fst-shuf: shuffle columns independently

Example:

fst-shuf -c 1 3 5 fastafile will return a fasta file in which columns 1,3 and 5 have been shuffled independently from the others

fst-shuf -r 1:3 fastafile will return a fasta file in which column 1,2 and 3 have been shuffled coherently with each others but independently from the rest of the sequence

fst-random: generate random sequences that match a given profile

Generated sequences can be completely random or build by modifying real ones.

Example:

fst-random profile 100 generate 100 random sequences with the single site frequencies defined in profile

fst-random profile 100 -t fastafile -d 0.9 generate 100 sequences by randomly mutate a fraction equal to 0.1 (1.-0.9) of amino acids. Mutations are such that the single site frequencies are the same as in profile. The mutated residues are randomly chosen for each sequence. The real sequences are chosen randomly among those provided in fastafile, possibly more than once

fst-profile: generate the profile/logo of a multiple sequence alignment

A profile is simply a list of all possible symbols that appear for each column together with its frequency

Example:

fst-profile fastafile will print the profile to screen

fst-encode:

Encode a Multiple Sequence Alignment in binary vectors by concatenating the one-hot encoding of each amino acid.

Example:

fst-encode fastafile profile encoding.npy

fst-fromstockholm: convert from stockholm format to fasta format

Example:

fst-fromstockholm stockholmfile will print the corresponding fasta file to screen

fst-download: download sequences and metadata from uniprot and/or metaclust

This program defines three commands:

  • fst-download uniprot will download and extract uniprot_trembl and uniprot_sprot to $UNIPROTFOLDER/date/ and create a symlink to it in $UNIPROTFOLDER/current_release. Currently the download is from https://ftp.expasy.org

  • fst-download metaclust will download and extract the latest metaclust release in $METACLUSTFOLDER/date/ and create a symlink to it in $METACLUSTFOLDER/current_release.

  • fst-download uniprot-metadata will take as input a file with a list of uids, and will download all metadata for the corresponding sequences from the current uniprot release.

fst-split:

Split an input fasta file in multiple fasta files containing no more than a specified number of entries.

Example:

fst-split fastafile records -l 100

will create new fasta files named records_*.fasta each containing no more than 100 entries from fastafile.

fst-distance:

Return the pairwise hamming distance (or sequence identity) between sequences in a MSA or between sequences in two MSAs. Optionally the distances/identities can be averaged.

Example:

fst-distance fastafile1 --aggregate -to fastafile2 --seqid

fst-logo:

Generates a logo from a profile thanks to logomaker, a package for generating publication quality sequence logos. There are options to control aesthetics.

Example:

fst-logo profile output.png

generates something like this:

logo

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

fastaUtils-1.0.0.tar.gz (21.6 kB view hashes)

Uploaded Source

Built Distribution

fastaUtils-1.0.0-py3-none-any.whl (25.2 kB view hashes)

Uploaded Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page