Skip to main content

Python package for interacting with SRAdb and downloading datasets from SRA

Project description

pysradb

https://img.shields.io/pypi/v/pysradb.svg https://travis-ci.com/saketkc/pysradb.svg?branch=master

Python package for interacting with SRAdb and downloading datasets from SRA.

Installation

To install stable version:

pip install pysradb

This step will install all the dependencies except aspera-client. Both Python 2 and Python 3 are supported.

Dependecies

pandas>=0.23.4
tqdm>=4.28
aspera-client
SRAmetadb.sqlite

SRAmetadb

SRAmetadb can be downloaded as:

wget -c https://starbuck1.s3.amazonaws.com/sradb/SRAmetadb.sqlite.gz && gunzip SRAmetadb.sqlite.gz

Alternatively, you can aslo download it using pysradb:

from pysradb import download_sradb_file
download_sradb_file()

SRAmetadb.sqlite.gz: 2.44GB [01:10, 36.9MB/s]

aspera-client

We strongly recommend using aspera-client (which uses UDP) since it enables faster downloads as compared to ftp/http based downloads.

PDF intructions are available here: https://downloads.asperasoft.com/connect2/.

Direct download links:

Once you download the tar relevant to your OS, say linux, follow these steps to install aspera:

tar -zxvf ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.tar.gz
bash ibm-aspera-connect-3.8.1.161274-linux-g2.12-64.sh
Installing IBM Aspera Connect
Deploying IBM Aspera Connect (/home/saket/.aspera/connect) for the current user only.
Install complete.

Installing pysradb in development mode

pip install -U pandas tqdm
git clone https://github.com/saketkc/pysradb.git
cd pysradb
pip install -e .

Interacting with SRA

Fetch the metadata table (SRA-runtable)

from pysradb import SRAdb
db = SRAdb('SRAmetadb.sqlite')
df = db.sra_metadata('SRP098789')
df.head()

study_accession

experiment_accession

experiment_title

run_accession

taxon_id

library_selection

library_layout

library_strategy

library_source

library_name

bases

spots

adapter_spec

avg_read_length

SRP098789

SRX2536403

GSM2475997: 1.5 µM PF-067446846, 10 min, rep 1; Homo sapiens; OTHER

SRR5227288

9606

other

SINGLE -

OTHER

TRANSCRIPTOMIC

2104142750

42082855

50

SRP098789

SRX2536404

GSM2475998: 1.5 µM PF-067446846, 10 min, rep 2; Homo sapiens; OTHER

SRR5227289

9606

other

SINGLE -

OTHER

TRANSCRIPTOMIC

2082873050

41657461

50

SRP098789

SRX2536405

GSM2475999: 1.5 µM PF-067446846, 10 min, rep 3; Homo sapiens; OTHER

SRR5227290

9606

other

SINGLE -

OTHER

TRANSCRIPTOMIC

2023148650

40462973

50

SRP098789

SRX2536406

GSM2476000: 0.3 µM PF-067446846, 10 min, rep 1; Homo sapiens; OTHER

SRR5227291

9606

other

SINGLE -

OTHER

TRANSCRIPTOMIC

2057165950

41143319

50

SRP098789

SRX2536407

GSM2476001: 0.3 µM PF-067446846, 10 min, rep 2; Homo sapiens; OTHER

SRR5227292

9606

other

SINGLE -

OTHER

TRANSCRIPTOMIC

3027621850

60552437

50

Downloading an entire project arranged experiment wise

from pysradb import SRAdb
db = SRAdb('SRAmetadb.sqlite')
df = db.sra_metadata('SRP017942')
db.download(df)

Downloading a subset of experiments

df = db.sra_metadata('SRP000941')
print(df.library_strategy.unique())
['ChIP-Seq' 'Bisulfite-Seq' 'RNA-Seq' 'WGS' 'OTHER']
df_rna = df[df.library_strategy == 'RNA-Seq']
db.download(df=df_rna, out_dir='/pysradb_downloads')()

Demo

https://nbviewer.jupyter.org/github/saketkc/pysradb/blob/master/notebooks/demo.ipynb

Citation

Pending.

A lot of functionality in pysradb is based on ideas from the original SRAdb package. Please cite the original SRAdb publication:

Zhu, Yuelin, Robert M. Stephens, Paul S. Meltzer, and Sean R. Davis. “SRAdb: query and use public next-generation sequencing data from within R.” BMC bioinformatics 14, no. 1 (2013): 19.

History

0.2.0 (12-03-2018)

Renamed methods

The following methods have been renamed and the changes are not compatible with 0.1.0 release:

  • get_query() -> query().

  • sra_convert() -> sra_metadata().

  • get_table_counts() -> all_row_counts().

New methods/functionality

  • download_sradb_file() makes fetching SRAmetadb.sqlite file easy; wget is no longer required.

  • ftp protocol is now supported besides fsp and hence aspera-client is now optional. We however, strongly recommend aspera-client for faster downloads.

Bug fixes

  • Silenced SettingWithCopyWarning by excplicitly doing operations on a copy of the dataframe instead of the original.

Besides these, all methods now follow a numpydoc compatible documentation.

0.1.0 (12-01-2018)

  • First release on PyPI.

Project details


Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

pysradb-0.2.0.tar.gz (23.4 kB view hashes)

Uploaded Source

Built Distribution

pysradb-0.2.0-py2.py3-none-any.whl (10.5 kB view hashes)

Uploaded Python 2 Python 3

Supported by

AWS AWS Cloud computing and Security Sponsor Datadog Datadog Monitoring Fastly Fastly CDN Google Google Download Analytics Microsoft Microsoft PSF Sponsor Pingdom Pingdom Monitoring Sentry Sentry Error logging StatusPage StatusPage Status page