Skip to main content

Reformat and condense multiple sequence alignments to highlight variability

Project description

alnvu makes a multiple alignment of biological sequences more easily readable by condensing it and highlighting variability.

dependencies

Required:

  • Python 2.7

Optional:

installation

Using setup.py:

cd alnvu
python setup.py install

examples

All of these examples can be run from within the package directory:

% cd alnvu
% ./av --help

usage: av [-h] [-v] [-q] [-w NUMBER] [-L NUMBER] [-x] [-g] [-r INTERVAL]
          [-s NUMBER] [-c] [-d NUMBER] [-D] [-C CASE] [-G] [-i] [-n NUMBER]
          [-N CHARACTER] [-S FILE] [-T FILE] [-o OUTFILE] [-F NUMBER]
          [-O ORIENTATION] [-b NUMBER]
          [infile]

Create formatted sequence alignments with optional pdf output.

positional arguments:
  infile                Input file in fasta format (reads stdin if missing)

optional arguments:
  -h, --help            show this help message and exit
  -v, --version         show program's version number and exit
  -q, --quiet           Suppress output of alignment to screen.

Layout:
  -w NUMBER, --width NUMBER
                        Width of sequence to display in each block in
                        characters [115]
  -L NUMBER, --lines-per-block NUMBER
                        Sequences (lines) per block. [75]

Column selection:
  -x, --exclude-invariant
                        Show only columns with at least N non-consensus bases
                        (set N using the '-a/--min-subs')
  -g, --include-gapcols
                        Show columns containing only gap characters.
  -r INTERVAL, --range INTERVAL
                        Range of columns to display (eg '-r start,stop')
  -s NUMBER, --min-subs NUMBER
                        Minimum NUMBER of substitutions required to define a
                        position as variable. [1]

Consensus display and sequence appearance:
  -c, --consensus       Show the consensus sequence [False]
  -d NUMBER, --compare-to NUMBER
                        Identify the reference sequence. Nucleotide positions
                        identical to the reference will be shown as a '.' The
                        default behavior is to use the consensus sequence as a
                        reference. Use the -i option to display the sequence
                        numbers for reference.
  -D, --no-comparison   Show all bases (ie, suppress comparsion with the
                        reference sequence).
  -C CASE, --case CASE  Convert all characters to a uniform case
                        ('upper','lower')
  -G, --ignore-gaps     Ignore gaps in the calculation of a consensus.

Sequence annotation:
  -i, --number-sequences
                        Show sequence number to left of name.
  -n NUMBER, --name-max NUMBER
                        Maximum width of sequence name in characters [35]
  -N CHARACTER, --name-split CHARACTER
                        Specify a character delimiting sequence names. By
                        default, the name of each sequence is the first
                        whitespace-delimited word. '--name-split=none' causes
                        the entire line after the '>' to be displayed.
  -S FILE, --sort-by-name FILE
                        File containing sequence names defining the sort-order
                        of the sequences in the alignment.
  -T FILE, --sort-by-tree FILE
                        File containing a newick-format tree defining the
                        sort-order of the sequences in the alignment (requires
                        biopython).

PDF output:
  These options require reportlab.

  -o OUTFILE, --outfile OUTFILE
                        Write output to a pdf file.
  -F NUMBER, --fontsize NUMBER
                        Font size for pdf output [7]
  -O ORIENTATION, --orientation ORIENTATION
                        Set page orientation; choose from portrait, landscape
                        [portrait]
  -b NUMBER, --blocks-per-page NUMBER
                        Number of aligned blocks of sequence per page [1]

The default output. Note that columns are numbered (column 8 is the first shown, column 122 is the last):

% ./av testfiles/10patients_aln.fasta | head -n 15
         # 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
         # 0000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000011111111111111111111111
         # 0011111111112222222222333333333344444444445555555555666666666677777777778888888888999999999900000000001111111111222
         # 8901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012
    H59735 AGAGTTTGATCCTGGCTCAGGACGAACGC.......GT.......................A.G..GCGGT....GCACCGTGGATT..........................T.
    T70875 ...........................---------------------------------------------------.----..--......--------------......T.
    F58095 AGAGTTTGATCCTGGCTCAGAGCGAACGC.......AT...................C....GTGGTTTCG..CATC-.----..--.............G.............G
    T70854 ...........................--.......AG..C.................G...ATG.CGGG.....GCTCCTTGATTC........C....G............TG
    F62024 AGAGTTTGATCCTGGCTCAGGACGAACGC.......GT.......................A.G..GCCTTT.GGGGTGGATT..--............................
    H59895 ...........................------------------------............G..AGAG.....AGCTCTCTGGATC...........................
    F57728 ...........................--------------------------------TT-----------------.----..--...........................G
    M10734 ...........................GC..A....GT........................GATCCATT...GCTTTTGTGTTTTTGGTGAG......................
    T71041 ..........................CGC.......AG.......................A.G..GTCT.....GCTAGACGGATT..........................TG
    M6161O ...........................--......T-G..C.....................ATCCTTCGG.A..---.----..--.............G..............

The input file can be provided via stdin:

% cat testfiles/10patients_aln.fasta | ./av

Exercising some of the options (show sequence numbers and a consensus; show differences with sequence number 1, restrict to columns 200-300):

% ./av testfiles/10patients_aln.fasta --number-sequences --consensus --compare-to 1 --range 200,300
               # 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000
               # 22222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222222223
               # 00000000001111111111222222222233333333334444444444555555555566666666667777777777888888888899999999990
               # 01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
 1 -ref-> H59735 TGGGGtG-TTGGTgGAAAGCgttatgga------------GTGGTTTTAGATGGGCTCACGGCCTATCAGCTTGTTGGTGAGGTAATGGCTTACCAAGGCG
 2        T70875 G..T---.------.....T.GGGGACCGCAAGGCCTC..AC.CAGCAG..GC...CG.T.T.TG..T....A.......G.....A...CC.........
 3        F58095 G.CC---.------C.....CGA.A.--.............C.CC...G..GC...CTG..T..G..T..G.A.......G.....A...C.......C.T
 4        T70854 G..A---.------......AGGGGACCTTCGGGCCTT...C.C.A.C.....A..CT.G.T.GG..T....A.......G..........C.........
 5        F62024 ....A-C.GG...TA.....TCCG----.............C...GAAG....A..C.G.....................G.........C..........
 6        H59895 .CTTCA..CA.C.......AA..-----............TC...CAGG....A....G................................C.........
 7        F57728 .C.A.-.A.A.A.-.....GTGGCCTCTACATGTAAGCTATCAC.GAAG..G...A.TG..T.TG..T....A.....A.G.....C...CC.........
 8        M10734 .....-T..GTTG......GT..T.T--............C...A..GG.........G....T................G...G...............T
 9        T71041 GA.A---.------.....G.GGC.TTTAGCTC.......TC.C.AA......A..CT.A.T.GG..T....A.......G.....A...C..........
10        M6161O G...---.------.....AT...----............TC.CCA..G..GC...C.G..T.TG..T....A.......G.....A....C.........
11     CONSENSUS X..X.X.A.X.X.......XXXXXXXCXXXXXGXXXXXTAXC.C.XXXG.......CXG..T.XG..T....A.......G.....X...XX.........

Write a single-page pdf file:

% ./av testfiles/10patients_aln.fasta --outfile=test.pdf --quiet --blocks-per-page=5

Same as above:

% ./av testfiles/10patients_aln.fasta -o test.pdf -q -b 5

And do you know about seqmagick? If not, run, don’t walk to https://github.com/fhcrc/seqmagick and check it out, so that you can do this:

% seqmagick convert testfiles/ae_like.sto --output-format=fasta - | ./av -cx
               # 000000000000000000000000000000000
               # 445555555555566666666666666667777
               # 990111111155813445566778888991122
               # 791123678914209568907050235891215
  GA05AQR01D2ULR ...............TTGGT.GT..AG...A..
  GA05AQR01DFGSE ........................T.TAAGT..
  GA05AQR01CI0QB ...........A.....................
  GA05AQR01DW22X .TC..G.T.T.......................
  GA05AQR01A5WF4 ....................A........-T..
  GA05AQR01BUV2U ---..............................
  GA05AQR01B1R8I .............T...............CT..
  GA05AQR02JASPX ........A........................
  GCX02B001AYSTJ .............................-TA.
  GCX02B001DP9EQ ............A..........CA.......T
  GCX02B001AFAY1 ..............G..................
  GCX02B002J489C ...-......A......................
  GLKT0ZE01EDLCP AT...ATT.T.......................
  GLKT0ZE02I8LRD ---GA............................
-ref-> CONSENSUS TCTAGCGCGCGGGGACGAACGAGGCGCGCTGGA

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

alnvu-0.1.0.tar.gz (13.8 kB view hashes)

Uploaded Source

Built Distribution

alnvu-0.1.0-py2-none-any.whl (14.0 kB view hashes)

Uploaded Python 2

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page