skip to navigation
skip to content

Not Logged In

bcbio-nextgen 0.4a

Automated, distributed next-gen sequencing pipeline; includes Galaxy interaction

Latest Version: 0.7.8

## Overview

Python scripts and modules for automated next gen sequencing analysis.
These provide a fully automated pipeline for taking sequencing results
from an Illumina sequencer, converting them to standard Fastq format,
aligning to a reference genome, doing SNP calling, and producing a
summary PDF of results.

The pipeline runs on single multicore machines, in compute clusters
managed by LSF or SGE, or on the Amazon cloud. [This tutorial][o5]
describes running the pipeline on Amazon with [CloudBioLinux][o6] and
[CloudMan][o7].

The scripts are tightly integrated with the [Galaxy][o1]
web-based analysis tool. Tracking of samples occurs via a web based LIMS
system, and processed results are uploading into Galaxy Data Libraries for
researcher access and additional analysis. See the
[installation instructions for the front end][o2] and a
[detailed description of the full system][o3].

![Overview of workflow][o4]

[o1]: http://galaxy.psu.edu/
[o2]: https://bitbucket.org/galaxy/galaxy-central/wiki/LIMS/nglims
[o3]: http://bcbio.wordpress.com/2011/01/11/next-generation-sequencing-information-management-and-analysis-system-for-galaxy/
[o4]: http://chapmanb.github.com/bcbb/nglims_organization.png
[o5]: http://bcbio.wordpress.com/2011/08/19/distributed-exome-analysis-pipeline-with-cloudbiolinux-and-cloudman/
[o6]: http://cloudbiolinux.org
[o7]: http://wiki.g2.bx.psu.edu/Admin/Cloud

## Code structure

The main scripts that handle automation of the analysis and storage
are:

* `scripts/illumina_finished_msg.py` -- Sits on the sequencer
  output machine; run via a cron job every hour to check for new
  output runs.

* `scripts/nextgen_analysis_server.py` -- Main server script;
  runs specific servers for top level analysis management, storage, or
  distributed processing of individual analysis steps:

 * Called with `-q toplevel` -- Coordinate sample processing,
   running the full automated analysis and optionally uploading
   results to Galaxy. `illumina_finished_msg.py` and a Galaxy
   graphical front end both send messages on this queue for starting
   processing.

 * Called with no queue arguments -- Run individual steps in the
   analysis pipeline. Start multiple servers on distributed machines
   connected with a shared filesystem to allow scaling on a
   cluster or Amazon EC2.

 * Called with `-q storage` -- Manage long term storage of larger
   files, like qseq and images.

Specify system specific information in configuration files:

* `config/transfer_info.yaml` -- Configuration on the sequencing
  machine, specifying where to check for new sequencing data.
* `config/post_process.yaml` -- Configuration for analysis and
  storage. This contains links to Galaxy, program locations and
  customization for processing algorithms.
* `config/universe_wsgi.ini` -- Variables used from your Galaxy
  server configuration, including RabbitMQ details for communication
  between the sequencing and analysis machines.

Scripts involved in the processing:

* `scripts/automated_initial_analysis.py` -- Drives the high level analysis of
  sequencing lanes based on information specified through the Galaxy LIMS
  system or in a YAML configuration file. Also produces a PDF summary file with
  statistics on alignments, duplicates, GC distribution, quality scores,
  and other metrics of interest.
* `scripts/upload_to_galaxy.py` -- Handles storing and uploading Fastq,
  alignment, analysis and summary files to Galaxy.

## Installation

### Required libraries and data files

The code drives a number of next-generation sequencing analysis
tools; install these on any machines involved in the processing. The
[CloudBioLinux][i2] and [Galaxy CloudMan][i3] projects contain
automated scripts to help with installation:

* [Install bioinformatics software][i4]
* Install data files like genome builds in association with
  Galaxy: [my script][i5] and [from the Galaxy team][i6].

Or they can be install manually; the Requirements section below lists
all software used by the pipeline.

### Scripts and configuration

Clone a copy of the code from GitHub:

      git clone https://github.com/chapmanb/bcbb.git

Use a recent version of Python 2 (2.6 or 2.7), and install with:

      cd bcbb/nextgen && python setup.py install

The setup script installs required python library dependencies.

Copy the YAML & ini files in config and adjust them to match your
environment. It is also a good idea to set your $PATH pointing to
any third-party binaries you are using.

### Parallel execution

The pipeline runs in parallel in two different ways:

* multiple cores -- Analyses will run in parallel using multiple cores
  on a single machine. This requires only the `mulitprocessing`
  Python library, included by default with most Python
  installations. Change `num_cores` in your `post_process.yaml` file
  to specify the number of parallel cores to use.

* parallel messaging -- This allows scaling beyond the cores on a
  single machine. It requires multiple machines with a shared
  filesystem, with communication handled using RabbitMQ messaging.
  This is ideally suited for clusters.

To enable parallel messaging:

1. Configure RabbitMQ as described below. Ensure all processing
   machines can talk to the RabbitMQ server on port 5672. Update
   `universe_wsgi.ini` to contain the server details.

2. Change `num_cores` in `post_process.yaml` to `messaging`

The `distributed_nextgen_pipeline.py` helper script runs pipelines in a
distributed cluster environment. It takes care of starting worker
nodes, running the processing, and then cleaning up after jobs:

1. Edit your `post_process.yaml` file to set parameters in the
  `distributed` section corresponding to your environment: this
  includes the type of cluster management, arguments to start jobs,
  and the number of workers to start.

2. Execute `distributed_nextgen_pipeline.py` using the same arguments as
   `automated_initial_analysis.py`: the `post_process.yaml`
   configuration file, the directory of fastq files, and a
   `run_info.yaml` file specifying the fastq details, barcodes,
   and the types of analyses to run.

If you have a different architecture you can run the distributed
processing by hand:

1. Start the processing server, `nextgen_analysis_server.py` on each
   processing machine. This takes one argument, the
   `post_process.yaml` file (which references the `universe_wsgi.ini`
   configuration).

2. Run the analysis script `automated_initial_analysis.py`. This will
   offload parallel work to the workers started in step 3 but also
   does processing itself, so is also submitted as a job in cluster
   environments.

### Testing

The test suite exercises the scripts driving the analysis, so
are a good starting point to ensure correct installation.
Run tests from the main code directory using [nose][i7]. To test the main
variant calling pipeline:

     $ cd tests
     $ nosetests -v -s test_automated_analysis.py:AutomatedAnalysisTest.test_1_variantcall


To run the full test suite:

     $ nosetest -v -s

`tests/test_automated_analysis.py` exercises the full framework using
an automatically downloaded test dataset. It runs through barcode
deconvolution, alignment and full SNP analysis. Tweak the
configuration for the tests for your environment:

* `tests/data/automated/post_process.yaml` -- May need adjustment to
  point to installed software in non-standard locations. Change the
  num_cores parameter to test multiple processor and parallel
  execution.
* `tests/data/automated/universe_wsgi.ini` -- Defines the location of
  the RabbitMQ server for messaging based parallel execution during
  tests.
* `tests/data/automated/run_info.yaml` -- Change the `analysis` variable
  can to 'Standard' if SNP calling is not required in your
  environment. This will run a smaller pipeline of alignment and analysis.

### RabbitMQ messaging server

RabbitMQ messaging manages communication between the sequencing
machine and the analysis machine. This allows complete separation
between all of the machines. The RabbitMQ server can run anywhere; an
easy solution is to install it on the Galaxy and analysis server:

        (yum or apt-get) install rabbitmq-server

Setup rabbitmq for passing Galaxy and processing messages:

        rabbitmqctl add_user <username> <password>
        rabbitmqctl add_vhost bionextgen
        rabbitmqctl set_permissions -p bionextgen <username> ".*" ".*" ".*"

Then adjust the `[galaxy_amqp]` section of your `universe_wsgi.ini`
Galaxy configuration file. An example configuration is available in
the config directory; you'll need to specifically change these three
values:

        [galaxy_amqp]
        host = <host you installed the RabbitMQ server on>
        userid = <username>
        password = <password>

### ssh keys

The sequencing, analysis and storage machines transfer files using
secure copy. This requires that you can securely copy files between
machines without passwords, using [ssh public key authentication][i1].
You want to enable password-less ssh for the following machine
combinations:

* Analysis server to `illumina_finished_msg` machine
* Storage server to `illumina_finished_msg` machine

### Sequencing machines

The sequencer automation has been fully tested using Illumina GAII and
HiSeq sequencing machines. The framework is general and supports other
platforms; we welcome feedback from researchers with different
machines at their institutions.

Illumina machines produce run directories that include the date, machine
identifier, and flowcell ID:

    110216_HWI-EAS264_00035_FC638NPAAXX

A shortened name, with just date and flowcell ID, is used to uniquely
identify each flowcell during processing.

### Development environment

The installation instructions assume that you have full root access to install
python modules and packages (production environment). If this is not the case,
you may want to install a python VirtualEnv and other tools automatically on your $HOME
to ease your development needs using the following script:

        wget https://bitbucket.org/brainstorm/custom_env/raw/1cd4f4ae27d5/pyHost.sh && ./pyHost.sh

[i1]: http://macnugget.org/projects/publickeys/
[i2]: http://cloudbiolinux.org
[i3]: http://www.biomedcentral.com/1471-2105/11/S12/S4
[i4]: https://github.com/chapmanb/cloudbiolinux/blob/master/fabfile.py
[i5]: https://github.com/chapmanb/cloudbiolinux/blob/master/data_fabfile.py
[i6]: https://bitbucket.org/afgane/mi-deployment/src
[i7]: http://somethingaboutorange.com/mrl/projects/nose/

## Requirements

### Next gen analysis

* [Picard][3]
* [bowtie][5] or [bwa][5b]
* [FastQC][6]
* [GATK][4] (version 1.2; only for variant calling)
* [snpEff][16] (version 2.0.2; only for variant calling)

[3]: http://picard.sourceforge.net/
[4]: http://www.broadinstitute.org/gsa/wiki/index.php/The_Genome_Analysis_Toolkit
[5]: http://bowtie-bio.sourceforge.net/
[5b]: http://bio-bwa.sourceforge.net/
[6]: http://www.bioinformatics.bbsrc.ac.uk/projects/fastqc/
[16]: http://sourceforge.net/projects/snpeff/

### Processing infrastructure

* RabbitMQ for communication between machines
* LaTeX and pdflatex for report generation

### Optional software for generating report graphs

* R with ggplot2, plyr, sqldf libraries.
* [rpy2][11]. Build R with shared libraries available: `./configure --enable-R-shlib`.

### Python modules installed with the package

* [biopython][10]
* [pysam][12]
* [mako][13]
* [PyYAML][14]
* [logbook] [17]
* [celery][18]
* [numpy][19]

[10]: http://biopython.org
[11]: http://rpy.sourceforge.net/rpy2.html
[12]: http://code.google.com/p/pysam/
[13]: http://www.makotemplates.org/
[14]: http://pyyaml.org/
[17]: http://packages.python.org/Logbook
[18]: http://celeryproject.org/
[19]: http://www.numpy.org/

## Variant calling

Broad's [GATK][4] pipeline drives variant (SNPs and Indels) analysis. This
requires some associated data files, and also has some configurable options. The
relevant section from the `post_process.yaml` file is:

    coverage_depth: "low" # other options: high
    coverage_interval: "exome" # other options: genome, regional
    dbsnp: variation/dbsnp_132.vcf
    train_hapmap: variation/hapmap_3.3.vcf
    train_1000g_omni: variation/1000G_omni2.5.vcf
    train_indels: variation/Mills_Devine_2hit.indels.vcf

The dbSNP and training files are from the [GATK resource bundle][v1]. These are
inputs into the training models for recalibration. The automated
[CloudBioLinux][o6] data scripts will download and install these in the
variation subdirectory relative to the genome files.

The `coverage_depth` and `coverage_interval` are adjustable from the defaults
within individual `run_info.yaml` files describing the sample; a fastq file
with standard phred quality scores, full genome coverage and high sequencing
depth:

    - analysis: SNP calling
      algorithm:
        quality_format: Standard
        coverage_interval: genome
        coverage_depth: high

[v1]: http://www.broadinstitute.org/gsa/wiki/index.php/GATK_resource_bundle

## Configuration options

The YAML configuration file provides a number of hooks to customize analysis.
Place these under the `analysis` keyword. For variant calling:

- `aligner` Aligner to use: [bwa, bowtie, bowtie2, mosaik, novoalign, false]
- `trim_reads` Whether to trim off 3' B-only ends from fastq reads [false, true]
- `align_split_size`: Split FASTQ files into specified number of records per
  file. Allows parallelization at the cost of increased temporary disk space usage.
- `variantcaller` Variant calling algorithm [gatk, freebayes]
- `quality_format` Quality format of fastq inputs [illumina, standard]
- `coverage_interval` Regions covered by sequencing. Influences GATK
  options for filtering [exome, genome, regional]
- `coverage_depth` Depth of sequencing coverage. Influences GATK variant
  calling [high, low]
- `hybrid_target` BED file with target regions for hybrid selection experiments.
- `variant_regions` BED file of regions to call variants in.
- `ploidy` Ploidy of called reads. Defaults to 2 (diploid).
- `recalibrate` Perform variant recalibration [true, false]
- `mark_duplicates` Identify and remove variants [false, true]
- `realign` Do variant realignment [true, false]
- `write_summary` Write a PDF summary of results [true, false]

Global reference files for variant calling and assessment:

- `train_hapmap`, `train_1000g_omni`, `train_indels` Training files for GATK
  variant recalibration.
- `call_background` Background VCF to use for calling.

## Internals: files generated by this pipeline

### Initial Fastq files (pre-analysis)

After basecalling, a number of FastQ files are generated via <a href="https://github.com/brainstorm/bcbb/blob/master/nextgen/scripts/solexa_qseq_to_fastq.py">solexa_qseq_to_fastq.py</a> script:

    1_081227_B45GT6ABXX_1_fastq.txt
    1_081227_B45GT6ABXX_2_fastq.txt
    (...)

The template of them being:

    lane_date_fcid_<1|2>_fastq.txt

Where 1|2 is the forward and reverse read respectively. Only reads that
pass the quality filter (PF) are kept. Quality scores
are not converted to Sanger, but are kept in the produced format (currently
Illumina 1.3+). The dots in the fastq files are converted into Ns.

## Post-processing generated files

Those are distributed in three directories: alignments, barcode and the top level run directory.

### Top level directory

Those files comprise both plain text files, images and structured data
that can be useful both automatically and in human-readable form to get
different aspects about the sequencing results.

* `run_summary.yaml`

Contains a structured view of the run global parameters.

    - barcode_id: '2'
      barcode_type: SampleSheet
      lane: '8'
      metrics:
        Aligned: '1204752'
        Pair duplicates: '881332'
        Read length: '101'
        Reads: '11594437'
      request: ''
      researcher: ''
      sample: ''

* PDF files

Taking a look at an specific sample in a lane, we find the different
sub-components that conform the final summary report (`*-summary.pdf`).
The summary contains GC content, read quality figures and an insert size
histogram.

    6_110126_B816J0ABXX_5_1_fastq_txt_quality.pdf
    6_110126_B816J0ABXX_5_2_fastq.txt.segments.hist.pdf
    6_110126_B816J0ABXX_5-sort-dup-gc.pdf
    6_110126_B816J0ABXX_5_1_fastq.txt.segments.hist.pdf
    6_110126_B816J0ABXX_5-sort_1_fastq_qual.pdf
    6_110126_B816J0ABXX_5-sort-dup-insert.pdf
    6_110126_B816J0ABXX_5_2_fastq_txt_quality.pdf
    6_110126_B816J0ABXX_5-sort_2_fastq_qual.pdf
    6_110126_B816J0ABXX_5-sort-summary.pdf

* BigWig files

Per-sample wigtoBigWig converted files.

* `*_metrics files`

They contain plain text values from picard, namely:

    1_110126_B816J0ABXX_5-sort-dup.align_metrics
    1_110126_B816J0ABXX_5-sort-dup.gc_metrics
    1_110126_B816J0ABXX_5-sort-dup.dup_metrics
    1_110126_B816J0ABXX_5-sort-dup.insert_metrics

They contain output metrics from Picard, which are parsed later and
used to generate plots. For instance, for `sort-dup.align_metrics`, the
output from net.sf.picard.analysis.CollectAlignmentSummaryMetrics is
stored.

### alignments directory

Contains the results of the alignments for each sample. As we see
on the listing below, lane 1, barcode id 5 has been aligned in
[SAM][in1] and BAM formats. For convenience,
to facilitate SNP calling, for instance, a sorted BAM file is also
generated.

    1_110126_B816J0ABXX_5.sam
    1_110126_B816J0ABXX_5.bam
    1_110126_B816J0ABXX_5-sort.bam
    1_110126_B816J0ABXX_5_1_fastq.bam

### *_barcode directories

Those contain fastq files conforming with the naming schema we've seen before. They are the result of the demultiplexing process, where the "unmatched" files contain the reads that have not passed the approximate barcoding matching algorithm:

    4_110126_B816J0ABXX_1_1_fastq.txt
    4_110126_B816J0ABXX_5_2_fastq.txt
    4_110126_B816J0ABXX_1_2_fastq.txt
    4_110126_B816J0ABXX_6_1_fastq.txt
    4_110126_B816J0ABXX_5_1_fastq.txt
    4_110126_B816J0ABXX_6_2_fastq.txt

    4_110126_B816J0ABXX_unmatched_1_fastq.txt
    4_110126_B816J0ABXX_unmatched_2_fastq.txt

    4_110126_B816J0ABXX_bc.metrics
    SampleSheet-barcodes.cfg

`*-barcodes.cfg` contains a simple mapping between barcode id's and the actual barcode sequence:

    3 ATCACGA
    2 ACTTGAA
    9 TAGCTTA
    (...)

The `_bc.metrics` file has a plain read distribution for each barcode:

    2   11594437
    3   20247932
    9   14390566
    unmatched   908420

Barcodes are added to the 3' end of the first sequence. That way, it
remains platform-independent and can be easily handled downstream.
This [GitHub discussion][in2] explains how demultiplexing works. The
demultiplexing is performed by the [barcode_sort_trim.py][in3] script.

[in1]: http://bioinformatics.oxfordjournals.org/content/early/2009/06/08/bioinformatics.btp352.short
[in2]: https://github.com/chapmanb/mgh_projects/commit/3387d82f3496025ad13b69e8d9cbb47cf6ee2af9#nglims_paper/nglims_galaxy.tex-P57
[in3]: https://github.com/brainstorm/bcbb/blob/master/nextgen/scripts/barcode_sort_trim.py
 
File Type Py Version Uploaded on Size
bcbio-nextgen-0.4a.tar.gz (md5) Source 2012-11-14 95KB
  • Downloads (All Versions):
  • 60 downloads in the last day
  • 628 downloads in the last week
  • 2690 downloads in the last month