Skip to main content

Efficient calculation of phylogenetic distance matrices.

Project description

PhyloDM

PyPI Conda (channel only) codecov.io

Efficient calculation of pairwise phylogenetic distance matrices.

Installation

  • PyPI: pip install phylodm
  • conda: conda install -c bioconda phylodm

Note: You must have a C++ compiler.

Usage

The leaf nodes in the tree must have unique names, otherwise a DuplicateIndex exception is raised.

Python library

Creating a phylogenetic distance matrix

A phylogenetic distance matrix (PDM) object can be created from either a DendroPy tree, or a newick file:

import dendropy
from phylodm.pdm import PDM

# Load from DendroPy
t = dendropy.Tree.get_from_string('(A:4,(B:3,C:4):1);', 'newick')
pdm = PDM.get_from_dendropy(tree=t, method='pd')

# Load from Newick
with open('/tmp/newick.tree', 'w') as fh:
    fh.write('(A:4,(B:3,C:4):1);')
pdm = PDM.get_from_newick_file('/tmp/newick.tree', method='pd')

Once created, a PDM can be cached to disk, where it can be later loaded:

import dendropy
from phylodm.pdm import PDM

# Create a PDM.
t = dendropy.Tree.get_from_string('(A:4,(B:3,C:4):1);', 'newick')
pdm_a = PDM.get_from_dendropy(tree=t, method='pd')

# Write to cache.
pdm_a.save_to_path('/tmp/pdm.mat')

# Load from cache.
pdm_b = PDM.get_from_path('/tmp/pdm.mat')

Accessing data

The PDM.as_matrix method generates a symmetrical numpy distance matrix and returns a tuple of keys in the matrix row/column order:

import dendropy
from phylodm.pdm import PDM

# Load from DendroPy
t = dendropy.Tree.get_from_string('(A:4,(B:3,C:4):1);', 'newick')
pdm = PDM.get_from_dendropy(tree=t, method='pd')
labels, mat = pdm.as_matrix(normalised=False)
"""
/------------[4]------------ A
+
|          /---------[3]--------- B
\---[1]---+
           \------------[4]------------- C
           
labels = ('A', 'B', 'C')
mat = [[0. 8. 9.]
       [8. 0. 7.]
       [9. 7. 0.]]
"""

# Retrieving a specific value
pdm.get_value('A', 'A')  # 0
pdm.get_value('A', 'C')  # 9
pdm.get_value('C', 'A')  # 9

Method

The method parameter can be either patristic distance (pd) or the count of edges between leaves (node).

Normalisation

If true, the data will be returned as normalised depending on the method:

  • pd = sum of all edges
  • node = count of all edges

CLI

The CLI can be used to create a phylogenetic distance matrix given a newick tree, e.g.:

python -m phylodm /path/to/newick.tree pd /path/to/matrix.mat

Performance

Tests were executed using the scripts/phylodm_perf.py script with 10 trials.

These tests demonstrate that PhyloDM is more efficient than DendroPy's phylogenetic distance matrix when there are over 500 taxa in the tree. If there are less than 500 taxa, then use DendroPy for all of the great features it provides.

With 10,000 taxa in the tree, each program uses approximately:

  • PhyloDM = 4 seconds / 2 GB memory
  • DendroPy = 17 minutes / 90 GB memory

DendroPy vs. PhyloDM PDM Construction TimeDendroPy vs. PhyloDM PDM Maximum Memory Usage

Changelog

1.1.0
  - Significant improvement in PDM construction time using C.
1.0.0
  - Initial release.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

phylodm-1.1.0.tar.gz (143.9 kB view hashes)

Uploaded Source

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page