No project description provided

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- MacOS :: MacOS X
- POSIX
Programming Language
- Python
- Rust

Project description

pyd4 - Python Binding for D4 File Format

This python module allows python read and write D4 file. It provides API for Python user effeicnet way to handle genomic quantatitive data, provides effecient routines to summarize, load and dump data from and to D4 format.

Also it provides a very effecient way to profile the covarage from a BAM file and load it to numpy array (typically the entire process only takes less than 2 minutes).

Installation

Note: PyD4 doesn't support Python2 or earlier, please use Python3 or later.

Install through pip is recommended

pip install pyd4

or you can also install from source with the setup.py:

git clone https://github.com/38/d4-format.git
cd d4-format/pyd4
./setup.py install

Quick start by Example

Here's some basic example to use the package.

from pyd4 import D4File

# Open a D4 File
file = D4File("test.d4")

# Print the chrom list
print(file.chroms())

# Get the mean cov for region chr1:10000000-20000000
print(file.mean("chr1:10000000-20000000"))

# Get a iterator over values
for i in file.value_iter("chr1", 0, 10000):
	print(i)

# Load the values to numpy 
data = file["chr1:0-10000"]

Use PyD4 with NumPy

PyD4 can be use with NumPy effeciently. It can load data from a D4 file as a numpy array for further analysis. For example

from pyd4 import D4File

# Open a D4 File
file = D4File("test.d4")

# Load chr1 as np array (this will take < than 1s)
per_base_depth = file.load_to_np("1")

# Then we can count the number of locus that is greater than 30 with numpy API
print((per_base_depth > 30).sum())

Alternatively, you can also use the index operator for that

per_base_depth_2 = file["2"]

It's possible to load a region from chromosome instead of the entire chromosome.

per_base_from_1m = file["3:1000000-"]
per_base_first_1m = file["3:-1000000"]
per_base_12345_to_22345 = file["3:12345-22345"]

Use PyD4 as a Bam Coverage Profiler

It's possible that we use PyD4 to get per-base coverage of a BAM within < 2min!

import pyd4

# Create D4 file from a BAM input
d4_file = pyd4.bam_to_d4("input.bam")

chr1_per_base = d4_file.load_to_np("1")

# Print number of locus that is > 30
print((chr1_per_base > 30).sum())

Dump NumPy array as D4File

import pyd4

input_file = pyd4.D4File("input.d4")

chr1_data = input_file["1"]

chr1_flags = chr1_data > 64

# create_on_same_genome will create a new D4 file that copies the same genome size from input_file and the list ["1"] tells the API only copy the chromosome 1
# for_bit_array tells PyD4 that this output should be optimized for a boolean array
# and finally we call get_writer to get the writer
output_file = input_file.create_on_same_genome("output.d4", ["1"]).for_bit_array().get_writer()

# Then we can dump the numpy array to the D4 file
# The first parameter specifies the chromosome we want to write
# The second parameter specifies the locus in the genome to write the first value of the np array
# The last parameter is the actual np array
output_file.write_np_array("1", 0, chr1_flags)

Fast Summarize

One of the key advantage of D4 is it provide a highly effecient way to summarize the data on multi-core CPUs. D4Py also provides the API that exposes those feature to Python users. Although most of the summarize task can be done with load_to_np API and numpy routines, but numpy doesn't support multicore CPU effeciently. Thus the summarize API is a faster way to summarize data.

To get the mean depth of chromosome 1

import pyd4

input_file = pyd4.D4File("input.d4")

# Slower way (single threaded) to compute the mean depth with numpy
np_array = input_file["1"]
print(np_array.mean())

# Faster way (parallel) to compute the same summary
print(input_file.mean("1"))

D4 also provides a high effecient way to perform batch summarize (For example down sample chromosome one per 1000 base pair window).

import pyd4

input_file = pyd4.D4File("input.d4")

down_sampled_chr1 = input_file.resample("1", bin_size = 1000)
print(down_sampled_chr1)

Changelog

0.3.1.1

Added documentation to the pypi page
Fixed minor bugs

Project details

These details have not been verified by PyPI

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Developers
License
- OSI Approved :: MIT License
Operating System
- MacOS :: MacOS X
- POSIX
Programming Language
- Python
- Rust

Release history Release notifications | RSS feed

This version

0.3.9

Jul 29, 2023

0.3.6.2

Jun 30, 2022

0.3.6.1

Jun 30, 2022

0.3.6

Apr 7, 2022

0.3.5

Mar 17, 2022

0.3.1.2

Mar 1, 2022

0.3.1.1

Mar 1, 2022

0.3.1

Jan 18, 2022

0.3.0

Dec 7, 2021

0.1.15

Oct 7, 2021

0.1.14

Oct 6, 2021

0.1.13

Oct 23, 2020

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pyd4-0.3.9.tar.gz (14.5 kB view hashes)

Uploaded Jul 29, 2023 Source

Hashes for pyd4-0.3.9.tar.gz

Hashes for pyd4-0.3.9.tar.gz
Algorithm	Hash digest
SHA256	`52021139315c65d63b68cc1805a107971ba70aee600e324c9565da1b69318c36`
MD5	`312a9d369531c4615021be2e202ae854`
BLAKE2b-256	`5508d8848894ab55a0e541f556324fd84fa33641ecda8708d07f747fb62d258b`