Skip to main content

SimLoRD is a read simulator for long reads from third generation sequencing and is currently focused on the Pacific Biosciences SMRT error model.

Project description

SimLoRD is a read simulator for third generation sequencing reads and is currently focused on the Pacific Biosciences SMRT error model.

Reads are simulated from both strands of a provided or randomly generated reference sequence.

Features

  • The reference can be read from a FASTA file or randomly generated with a given GC content. It can consist of several chromosomes, whose structure is respected when drawing reads. (Simulation of genome rearrangements may be incorporated at a later stage.)

  • The read lengths can be determined in four ways: drawing from a log-normal distribution (typical for genomic DNA), sampling from an existing FASTQ file (typical for RNA), sampling from a a text file with integers (RNA), or using a fixed length

  • Quality values and number of passes depend on fragment length.

  • Provided subread error probabilities are modified according to number of passes

  • Outputs reads in FASTQ format and alignments in SAM format

System requirements

We recommend using miniconda and creating an environment for SimLoRD

# Create and activate a new environment called simlord
conda create -n simlord python=3 pip numpy scipy cython
source activate simlord

# Install packages that are not available with conda from pip
pip install pysam
pip install dinopy
pip install simlord

# You now have a 'simlord' script; try it:
simlord --help

# In case of a new version update as follows:
pip install simlord --upgrade

# To switch back to your normal environment, use
source deactivate

Platform support

SimLoRD is a pure Python program. This means that it runs on any operating system (OS) for which Python 3 and the other packages are available.

Example usage

Example 1: Simulate 10000 reads for the reference ref.fasta, use the default options for simulation and store the reads in myreads.fastq and the alignment in myreads.sam.

simlord  --read-reference ref.fasta -n 10000  myreads

Example 2: Generate a reference with 10 mio bases GC content 0.6 (i.e., probability 0.3 for both C and G; thus 0.2 probability for both A and T), store the reference as random.fasta, and simulate 10000 reads with default options, store reads as myreads.fastq, do not store alignments.

simlord --generate-reference 0.6 10000000 --save-reference random.fasta\
        -n 10000 --no-sam  myreads

Example 3: Simulate reads from the given reference.fasta, using a fixed read length of 5000 and custom subread error probabilities (12% insertion, 12% deletion, 2% substitution). As before, save reads as myreads.fastq and myreads.sam.

simlord --read-reference reference.fasta  -n 10000 -fl 5000\
        -pi 0.12 -pd 0.12 -ps 0.02  myreads

A full list of parameters, as well as their documentation, can be found here.

Last Changes

Version 1.0.4 (2020-01-07)

Bugs fixed

  • Added missing else for parameter sam_output.

Other Changes

  • Changed read names.

  • New read name format: ‘m{read_number}/{read_length}/CCS read_information’

  • Added parameter –old-read-names for old read names where all information is encoded in one large string delimited by ‘;’.

Version 1.0.3 (2019-05-20)

New Features

  • Added new parameter –gzip to gzip the output reads fastq file.

  • If “-” instead of a filename is given, the reads are printed to sdt-out.

  • In this case without further specification the sam-file gets the name “reads.sam” in the current working directory.

Other Changes

  • Changed coverage parameter from int to float allowing fractional coverage values.

  • Changed delimiter in read id from _ to ;

  • Added chromosome name to read id

  • Changed id of mate read in sam file to result in “*” instead of “=”.

  • Changed fastq writing from text to byte writing to speed up I/O

Version 1.0.2 (2017-03-17)

New Features

  • Draw chromosomes for reads weighted with their length instead of equal distributed. This leads to a equal distributed read coverage over the chromosomes. Previous behaviour with equal probabilities for each chromosome can be activated with parameter –uniform-chromosome-probability.

  • Parameter –coverage: Determine number of reads depending on the desired read coverage of the whole reference genome.

  • Parameter –without-ns: Sample the reads only from regions completly without Ns.

Warning: Using –without-ns may lead to biased read coverage depending on the size of contigs without Ns and the expected readlength.

Bugs fixed

  • CIGAR string had sometimes wrong count of last match because of false extension after deletion.

Version 1.0.1 (2017-01-03)

Bugs fixed

  • Removed nargs=1 at parameter –probability-threshold leading to an error when changing the parameter.

Version 1.0.0 (2016-07-13)

API Changes

  • Changed SEQ in SAM file to reverse complemented read instead of the original read for reads mapping to the reverse complement of the reference.

Example:

reference       ATCG     read   CAAT
true alignment  ||X|
                ATTG

Before: SEQ CAAT and CIGAR string 2=1X1=
Now:    SEQ ATTG and CIGAR string 2=1X1=

License

SimLoRD is Open Source and licensed under the MIT License.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

simlord-1.0.4.zip (22.2 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page