skip to navigation
skip to content

Not Logged In

pysamstats 0.7

A Python utility for calculating statistics per genome position based on pileups from a SAM or BAM file.

Latest Version: 0.23

pysamstats
==========

A Python utility for calculating statistics against genome positions
based on sequence alignments from a SAM or BAM file.

* Source: https://gihub.com/alimanfoo/pysamstats
* Download: http://pypi.python.org/pypi/pysamstats

Installation
------------

```
$ pip install --upgrade pysam pysamstats
```

N.B., pysamstats depends on [pysam]http://code.google.com/p/pysam/
which needs to be installed *before* attempting to install
pysamstats. The command above should do it.

Pysamstats also depends on [numpy]http://www.numpy.org/ but this
*should* be installed automatically if you run the command above.

If you have any problems, try installing pysam and numpy separately
first.

Alternatively, clone the git repo and build in-place (requires
cython):

```
$ git clone git://github.com/alimanfoo/pysamstats.git
$ cd pysamstats
$ python setup.py build_ext --inplace
```

Usage
-----

From the command line:

```
$ pysamstats --help
Usage: pysamstats [options] FILE

Calculate statistics against genome positions based on sequence alignments
from a SAM or BAM file and print them to stdout.

Options:
-h, --help show this help message and exit
-t TYPE, --type=TYPE type of statistics to print: coverage,
coverage_strand, coverage_ext, coverage_ext_strand,
coverage_normed, coverage_gc, coverage_normed_gc,
variation, variation_strand, tlen, tlen_strand, mapq,
mapq_strand, baseq, baseq_strand, baseq_ext,
baseq_ext_strand, coverage_binned, coverage_ext_binned
-c CHROMOSOME, --chromosome=CHROMOSOME
chromosome name
-s START, --start=START
start position (1-based)
-e END, --end=END end position (1-based)
-z, --zero-based use zero-based coordinates (default is false, i.e.,
use one-based coords)
-f FASTA, --fasta=FASTA
reference sequence file, only required for some
statistics
-o, --omit-header omit header row from output
-p N, --progress=N report progress every N rows
--window-size=N size of window for binned statistics [300]
--window-offset=N window offset to use for deciding which genome
position to report binned statistics against [150]

Supported statistics types:

* coverage - number of reads aligned to each genome position
(total and properly paired)
* coverage_strand - as coverage but with forward/reverse strand counts
* coverage_ext - various additional coverage metrics, including
coverage for reads not properly paired (mate
unmapped, mate on other chromosome, ...)
* coverage_ext_strand - as coverage_ext but with forward/reverse strand counts
* coverage_normed - depth of coverage normalised by median or mean
* coverage_gc - as coverage but also includes a column for %GC
* coverage_normed_gc - as coverage_normed but also includes columns for normalisation
by %GC
* variation - numbers of matches, mismatches, deletions,
insertions, etc.
* variation_strand - as variation but with forward/reverse strand counts
* tlen - insert size statistics
* tlen_strand - as tlen but with statistics by forward/reverse strand
* mapq - mapping quality statistics
* mapq_strand - as mapq but with statistics by forward/reverse strand
* baseq - baseq quality statistics
* baseq_strand - as baseq but with statistics by forward/reverse strand
* baseq_ext - extended base quality statistics, including qualities
of bases matching and mismatching reference
* baseq_ext_strand - as baseq_ext but with statistics by forward/reverse strand
* coverage_binned - as coverage but binned
* coverage_ext_binned - as coverage_ext but binned

Examples:

pysamstats --type coverage example.bam > example.coverage.txt
pysamstats --type coverage --chromosome Pf3D7_v3_01 --start 100000 --end 200000 example.bam > example.coverage.txt

Version: 0.6 (pysam 0.7.4)
```

From Python:

```python
import pysam
import pysamstats

mybam = pysam.Samfile('/path/to/your/bamfile.bam')
for rec in pysamstats.stat_coverage(mybam, chrom='Pf3D7_01_v3', start=10000, end=20000):
print rec['chrom'], rec['pos'], rec['reads_all'], rec['reads_pp']
...

```

For convenience, functions are provided for loading data directly into numpy arrays, e.g.:

```python
import pysam
import pysamstats
import matplotlib.pyplot as plt

mybam = pysam.Samfile('/path/to/your/bamfile.bam')
a = pysamstats.load_coverage(mybam, chrom='Pf3D7_01_v3', start=10000, end=20000)
plt.plot(a.pos, a.reads_all)
plt.show()
```

Field Definitions
-----------------

The suffix **_fwd** means the field is restricted to reads mapped to
the forward strand, and **_rev** means the field is restricted to
reads mapped to the reverse strand. E.g., **reads_fwd** means the
number of reads mapped to the forward strand.

The suffix **_pp** means the field is restricted to reads flagged as
properly paired.

* **chrom** - Chromosome name.

* **pos** - Position within chromosome. One-based by default when
using the command line, zero-based by default when using the
python API.

* **reads_all** - Number of reads aligned at the position. N.b., this
is really the total, i.e., includes reads where the mate is
unmapped or otherwise not properly paired.

* **reads_pp** - Number of reads flagged as properly paired by the
aligner.

* **reads_mate_unmapped** - Number of reads where the mate is
unmapped.

* **reads_mate_other_chr** - Number of reads where the mate is mapped
to another chromosome.

* **reads_mate_same_strand** - Number of reads where the mate is
mapped to the same strand.

* **reads_faceaway** - Number of reads where the read and its mate are
mapped facing away from each other.

* **reads_softclipped** - Number of reads where there is some
softclipping at some point in the read's alignment (not
necessarily at this position).

* **reads_duplicate** - Number of reads that are flagged as duplicate.

* **dp_normed_median** - Number of reads divided by the median number
of reads over all positions in the specified region, or whole
genome if no region specified.

* **dp_normed_mean** - Number of reads divided by the mean number of
reads over all positions in the specified region, or whole genome
if no region specified.

* **dp_percentile** - Percentile within which the number of reads falls
considering all positions in the specified region, or whole genome
if no region specified.

* **gc** - Percentage GC content in the reference at this position
(depends on window length and offset specified).

* **dp_normed_median_gc** - As *dp_normed_median* but normalised by
positions with the same percent GC composition.

* **dp_normed_mean_gc** - As *dp_normed_mean* but normalised by
positions with the same percent GC composition.

* **dp_percentile_gc** - As *dp_percentile* but only considering
positions with the same percent GC composition.

* **matches** - Number of reads where the aligned base matches the
reference.

* **mismatches** - Number of reads where the aligned base does not
match the reference (but is not a deletion).

* **deletions** - Number of reads where there is a deletion in the
alignment at this position.

* **insertions** - Number of reads where there is an insertion in the
alignment at this position.

* **A/C/T/G/N** - Number of reads where the aligned base is an A/C/T/G/N.

* **mean_tlen** - Mean value of outer distance between reads and their
mates for paired reads aligned at this position. N.B., leftmost
reads in a pair have a positive tlen, rightmost reads have a
negative tlen, so if there is no strand bias, this value should be
0.

* **rms_tlen** - Root-mean-square value of outer distance between
reads and their mates for paired reads aligned at this position.

* **std_tlen** - Standard deviation of outer distance between reads
and their mates for paired reads aligned at this position.

* **reads_mapq0** - Number of reads where mapping quality is zero.

* **rms_mapq** - Root-mean-square mapping quality for reads aligned at
this position.

* **max_mapq** - Maximum value of mapping quality for reads aligned at
this position.

* **rms_baseq** - Root-mean-square value of base qualities for bases
aligned at this position.

* **rms_baseq_matches** - Root-mean-square value of base qualities for
bases aligned at this position where the base matches the
reference.

* **rms_baseq_mismatches** - Root-mean-square value of base qualities
for bases aligned at this position where the base does not match
the reference.

 
File Type Py Version Uploaded on Size
pysamstats-0.7.tar.gz (md5) Source 2013-05-18 17KB
  • Downloads (All Versions):
  • 163 downloads in the last day
  • 606 downloads in the last week
  • 3864 downloads in the last month