Skip to main content

Oxford Nanopore Technologies - ModBAM to bedMethyl

Project description

.

modbam_to_bedmethyl

modbam_to_bedmethyl is a tool to aggregate aligned, basecalled data, annotated with modifications, (a bam file with an MM+ML tag) into a reference space summary showing the reference coverage and fraction of reads with base modifications at those sites. These are output as a bedmethyl file.

Getting Started

The modbam_to_bedmethyl is available on PyPI and can be installed via pip:

pip install modbam-to-bedmethyl **this is not yet implemented as not released yet**

Alternatively, it is available on github where it can be built from source:

git clone https://github.com/nanoporetech/modbam-to-bedmethyl
pip install ./modbam_to_bedmethyl

To validate your installation, source the modbam_to_bedmethyl virtual environment and execute:

python -m unittest discover

Dependencies

modbam_to_bedmethyl is a pure python project that requires python>=3.6.

It also requires:

Interface

To covert from a modbam file to bedmethyl files use the modbam_to_bedmethyl module:

modbam_to_bedmethyl
[required]
      -i, --input_bam
                path to a sam/bam file.
      -m, --modified_threshold
                percentage threshold, equal to or above which a
                modification is considered to be present.
      -u, --unmodified_threshold
                percentage threshold, equal to or below which a
                modification is considered to not be present.
      -s, --save_path
                save path for bedmethyl file.
[optional]
      -t, --type_selector
                Output only a single modification type (default:
                output all modifications). Modification type specified
                by either character (e.g. m, h), abbreviation (5mC,
                6mA), or ChEBI code (e.g. 27551).
      -w, --max_workers
                Specify the maximum number of sub-process workers
                to launch (def. 40).

Example usage

Process a bam file and output a bedmethyl files for only the ‘m’ (a.k.a 5mC, 27551) modification to the current working directory. This will create the bedmethyl file input_bam_m.bed:

modbam_to_bedmethyl
    --input_bam input_bam.bam           # Specify the input file
    --save_path .                       # Specify the output directory
    --modified_threshold 0.66           # Modification thresholds
    --unmodified_threshold 0.33
    --type_selector m                   # Modification Selection
    --max_workers 40                    # Number of sub-process workers

Equivalently:

modbam_to_bedmethyl -i <path_to_bam> -s . -m <threshold> -u <threshold> -t m -w 40

Further Information

Filtering Reads

Only primary aligned reads with solvable modbase modification tags are processed, therefore any read which is secondary or supplementary is skipped.

If the aligned read has hard-clipped bases, there is sufficient loss of context that the modification annotation cannot be resolved accurately. Any read with hard-clipped bases is also skipped.

Strandedness

The modbase tag specification annotations state the strandedness of the modifications. This information allows for modifications on 2d reads to be annotated by indicating on which strand the modification was observed. The table below shows how forward and reverse complement reads with top (+) and bottom (-) modifications are resolved in reference space.

For example, a reverse complement aligned read (-) with modifications on the bottom-strand (-) will be resolved to the forward strand (+) aligned in the reference direction.

Strandedness in output file

SEQ direction according to FLAG

0x10 not set (i.e. forward alignment)

0x10 set (i.e. reverse-complement alignment)

Modification Tag Strandedness

+

+

-

-

-

+

Modification Types and ChEBI Codes

This application can process all modifications noted in the modbam specification in both the ‘bam character’ or ChEBI (Chemical Entities of Biological Interest) styles. (e.g. C+m; and/or C+27551;). The ChEBI identities and modification definitions of all supported modifications are found in modbam_to_bedmethyl/ChEBI/ChEBI.yaml.

Note that the molecule name spellings are those used by the modbam specification and not the ChEBI specification although they should be similar. The modification abbreviations used (e.g. 5mC) are defined by the modbam tag specification only and are not supported by ChEBI. These can be found at: https://github.com/samtools/hts-specs/pull/418/files#diff-e765c6479316309f56b636f88189cdde8c40b854c7bdcce9ee7fe87a4e76febcR534

The ChEBI identities are used under the Creative Commons License (CC BY 4.0).

Multiple Modifications

If there are multiple modifications at a single site, only the modification with the highest likelihood is recorded at that site, and so the lower scoring modification does not count towards the coverage or percent aligned.

Output file name

The output file will be of the form <bam_name>_<bam_character>.bed. Where ChEBI Identities are used, these uphold the output naming convention by translating the ChEBI identity to its bam character counterpart.

For example: ‘alignment_1.bam’ gives [‘alignment_1_m.bed’, ‘alignment_1_h.bed’]

Indexing in basecall space

MM tags list the number of occurrences of the named base in the basecall sequence between each modified base of that name. So a tag of C+m,1,0,1 indicates modifications on the 2nd, 3rd and 5th C-base in the sequence. All indexing in the MM tags is done in basecall-space in the original direction of the read, i.e. before a reference alignment. In the case of reverse complement alignments the basecall_sequence is _not_ the query_sequence listed in the bam file, which is given in the reference direction, and a reverse complement is required to recover the original basecall sequence from the bam file.

In the case where reads have been incorrectly annotated you will commonly see an IndexError as it will attempt to identify bases which may be beyond the scope of the basecall sequence.

Viewing the Sequence with Annotated Modifications

To view the basecalled sequence with annotated modifications the print_sequence_modifications script is available:

print_sequence_modifications --help

This script simply writes to stdout the basecalled sequence in the following format:

base{{modification}{percentage}...}.

For example:

Single Modifications:   Cm99
Multiple Modifications: Cm62h12
ChEBI Modifications:    C(27551)55

The output is written as two columns with the basecalled sequence base on the left and the complement base on the right. The output is always in the as-basecalled direction.

For example, given a forward read with:

query_sequence = 'AGGTT'
Modifications tags = 'Mm:Z:T+g,0;G-m,1; Ml:B:C,196,251'

We would observe:

A         T
G         C
G         Cm98
Tg76      A
T         A

This example is tested in test/data/synthetic_reads/print_sequence_examples/docstr.sam

Project details


Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page