Convert bioinformatics data to Zarr

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering

Project description

bio2zarr

Convert bioinformatics file formats to Zarr

Initially supports converting VCF to the sgkit vcf-zarr specification

This is early alpha-status code: everything is subject to change, and it has not been thoroughly tested

Install

$ python3 -m pip install bio2zarr

This will install the programs vcf2zarr, plink2zarr and vcf_partition into your local Python path. You may need to update your $PATH to call the executables directly.

Alternatively, calling

$ python3 -m bio2zarr vcf2zarr <args>

is equivalent to

$ vcf2zarr <args>

and will always work.

vcf2zarr

Convert a VCF to zarr format:

$ vcf2zarr convert <VCF1> <VCF2> <zarr>

Converts the VCF to zarr format.

Do not use this for anything but the smallest files

The recommended approach is to use a multi-stage conversion

First, convert the VCF into the intermediate format:

vcf2zarr explode tests/data/vcf/sample.vcf.gz tmp/sample.exploded

Then, (optionally) inspect this representation to get a feel for your dataset

vcf2zarr inspect tmp/sample.exploded

Then, (optionally) generate a conversion schema to describe the corresponding Zarr arrays:

vcf2zarr mkschema tmp/sample.exploded > sample.schema.json

View and edit the schema, deleting any columns you don't want, or tweaking dtypes and compression settings to your taste.

Finally, encode to Zarr:

vcf2zarr encode tmp/sample.exploded tmp/sample.zarr -s sample.schema.json

Use the -p, --worker-processes argument to control the number of workers used in the explode and encode phases.

Shell completion

To enable shell completion for a particular session in Bash do:

eval "$(_VCF2ZARR_COMPLETE=bash_source vcf2zarr)"

If you add this to your .bashrc vcf2zarr shell completion should available in all new shell sessions.

See the Click documentation for instructions on how to enable completion in other shells. a

plink2zarr

Convert a plink .bed file to zarr format. This is incomplete

vcf_partition

Partition a given VCF file into (approximately) a give number of regions:

vcf_partition 20201028_CCDG_14151_B01_GRM_WGS_2020-08-05_chr20.recalibrated_variants.vcf.gz -n 10

gives

chr20:1-6799360
chr20:6799361-14319616
chr20:14319617-21790720
chr20:21790721-28770304
chr20:28770305-31096832
chr20:31096833-38043648
chr20:38043649-45580288
chr20:45580289-52117504
chr20:52117505-58834944
chr20:58834945-

These reqion strings can then be used to split computation of the VCF into chunks for parallelisation.

TODO give a nice example here using xargs

WARNING that this does not take into account that indels may overlap partitions and you may count variants twice or more if they do

Project details

These details have not been verified by PyPI

Project links

GitHub Statistics

View statistics for this project via Libraries.io, or by using our public dataset on Google BigQuery

Development Status
- 3 - Alpha
Intended Audience
- Science/Research
License
- OSI Approved :: Apache Software License
Operating System
- OS Independent
Programming Language
Topic
- Scientific/Engineering

Release history Release notifications | RSS feed

This version

0.0.9

May 2, 2024

0.0.8

May 1, 2024

0.0.6

Apr 24, 2024

0.0.5

Apr 17, 2024

0.0.4

Apr 8, 2024

0.0.3

Mar 28, 2024

0.0.2

Mar 27, 2024

0.0.1

Mar 6, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

bio2zarr-0.0.9.tar.gz (161.6 kB view hashes)

Uploaded May 2, 2024 Source

Built Distribution

bio2zarr-0.0.9-py3-none-any.whl (46.5 kB view hashes)

Uploaded May 2, 2024 Python 3

Hashes for bio2zarr-0.0.9.tar.gz

Hashes for bio2zarr-0.0.9.tar.gz
Algorithm	Hash digest
SHA256	`db2f909046610eb551170fd08671e12060c5b6626668f21b5972d1cc22aab46c`
MD5	`cb7f66cac252014cd1ea53572054e421`
BLAKE2b-256	`6ec9e0918f8b72d9e88ac86e68e7106aa27d4dc6fc72116f1080e0a91b4070a7`

Hashes for bio2zarr-0.0.9-py3-none-any.whl

Hashes for bio2zarr-0.0.9-py3-none-any.whl
Algorithm	Hash digest
SHA256	`da6d25eed7d59ec7c93668f9dc72c3e37441c43a8a35f2c395c9e1c2f65a8870`
MD5	`ec16e928621a577258329dfde9113993`
BLAKE2b-256	`045357f2b9d4afb38e9f75a5bb5bd499a098204144ddd721d7a49f3323792953`