Skip to main content

A fast 23andMe raw genome file parser

Project description

arv — a fast 23andMe parser for Python
======================================
|travis-status| |versions| |license| |pypi|

Arv (Norwegian; "heritage" or "inheritance") is a Python module for parsing raw
23andMe genome files. It lets you lookup SNPs from RSIDs.

.. code:: python

from arv import load, unphased_match as match

genome = load("genome.txt")

print("You are a {gender} with {color} eyes and {complexion} skin.".format(
gender = "man" if genome.y_chromosome else "woman",
complexion = "light" if genome["rs1426654"] == "AA" else "dark",
color = match(genome["rs12913832"], {"AA": "brown",
"AG": "brown or green",
"GG": "blue"})))

For my genome, this little program produces::

You are a man with blue eyes and light skin.

The parser is insanely fast, having been written in finely tuned C++, exposed
via Cython. A 2013 Xeon machine I've tested on parses a 24 Mb file into a hash
table in about 78 ms. The newer 23andMe files are smaller, and parses in a mere
62 ms!

Works with Python 2.7+ and 3+. Installable with pip!

.. code:: bash

pipinstallupgradearvSeebelowforsoftwarerequirements.Importantdisclaimer====================ItsveryimportanttotellyouthatI,theauthorofarv,ammerelyahobbyist!Iamaprofessionalsoftwaredeveloper,butnotageneticist,biologist,medicaldoctororanythinglikethat.Becauseofthat,thissoftwaremaynotonlylookweirdtopeopleinthefield,itmayalsocontainseriouserrors.Ifyoufindanyproblemwhatsoever,pleasesubmitaGitHubissue.ThisaslightlymodifiedversionofwhatIwrotefortheoriginalsoftwarecalled"dnatraits",andthesamegoesforthissoftware:InadditiontotheGPLv3licensingterms,andgiventhatthiscodedealswithhealthrelatedissues,Iwanttostressthattheprovidedcodemostlikelycontainserrors,orinvalidgenomereports.ResultsfromthiscodemustbeinterpretedasHIGHLYSPECULATIVEandmayevenbedownrightINCORRECT.Alwaysconsultanexpert(medicaldoctor,geneticist,etc.)forguidance.ItakeNORESPONSIBILITYwhatsoeverforanyconsequencesofusingthiscode,includingbutnotlimitedtolossoflife,money,spouses,selfesteemandsoon.UseatYOUROWNRISK.Theindendeduseisforcasual,educationalpurposes.Ifthiscodeisusedforresearchpurposes,pleasecrosscheckkeyresultswithothersoftware:Theparsercodemaycontainseriouserrors,forexample.Aninterestingstoryabouttheresearchpart:IoncereleasedaprettygoodMersenneTwisterPRNGforC++thatendedupbeingusedinresearch.Turnedouttheenginehadbugs,andbythetimeIhadfixedthem,apoorresearcherhadalreadyproducedresultswithit(hopefullynotpublished;Idontknow).Theguyhadtogobackandfixhisstuff,andIfeltterriblybadaboutit.Sobeware!Installation============TherecommendedwayistoinstallfromPyPi...code::bash pip install arv

This will most likely build Arv from source. The package will automatically
install Cython, but it doesn't check if you have a C++11 compiler. Furthermore,
it passes some additional compilation flags that are specific to clang/gcc.

If you have problems running ``pip install arv``, please open an issue on
GitHub with as much detail as possible (``g++/clang++ --version``, ``uname
-a``, ``python --version`` and so on).

If you set the environment variable ``ARV_DEBUG``, it will build with full
warnings and debug symbols.

You can also install it locally through ``setup.py``. The following builds and
tests, but does not install, arv:

.. code:: bash

pythonsetup.pytestIfyousettheenvironmentvariableARVBENCHMARKtoagenomefilenameandrunthetests,itwillperformashortbenchmark,reportingthebestparsingtimeonit.YoucanalsosetARVBENCHMARKCOUNT=<number>tochangehowmanytimesitshouldparsethegivenfile.Usage=====Firstyouneedtodumptherawgenomefilefrom23andMe.Youllfinditundertherawgenomebrowser,anddownloadthefile.Youmayhavetounzipitfirst:Theparserworksonthepuretextfiles.ThenyouloadthegenomeinPythonwith..code::python>>>genome=arv.load("filename.txt")>>>genome<Genome:SNPs=960613,name=filename.txt>ToseeifthereareanyYchromosomespresentinthegenome,..code::python>>>genome.ychromosomeTrueThegenomeprovidesadictlikeinterface.TogetagivenSNP,justentertheRSID...code::python>>>genome["rs123"]>>>snp<SNP:chromosome=7position=24966446genotype=AA>>>>snp.chromosome7>>>snp.position24966446>>>snp.genotype<GenotypeAA>TheGenotypeobjectcanbeconvertedtoastringwithstr,butitalsoallowsrichcomparisonswithstringsdirectly:..code::python>>>snp.genotype=="AA"Trueyoucangetitscomplementwiththe operator...code::python>>>type(snp.genotype)<classarv.Genotype>>>> snp.genotype<GenotypeTT>ThecomplementisimportantduetoeahSNPsorientation.Allof23andMeSNPsareorientedtowardsthepositive("plus")strand,basedontheGRCh37<https://www.ncbi.nlm.nih.gov/grc/human>referencehumangenomeassemblybuild.ButsomeSNPsonSNPediaaregivenwiththeminusorientation<http://snpedia.com/index.php/Orientation>.Forexample,todetermineifthehumaninquestionislikelylactosetolerantornot,wecanlookatrs4988235<http://snpedia.com/index.php/Rs4988235>.SNPediareportsitsStabilizedorientationtobeminus,soweneedtousethecomplement:..code::python>>>genome["rs4988235"].genotype<GenotypeAA>>>> genome["rs4988235"].genotype<GenotypeTT>ByreadingafewGWAS<https://en.wikipedia.org/wiki/Genomewideassociationstudy>researchpapers,wecanbuildaruletodetermineahumanslikelihoodforlactosetolerance:..code::python>>>arv.unphasedmatch( genome["rs4988235"].genotype,"TT":"Likelylactosetolerant","TC":"Likelylactosetolerant","CC":"Likelylactoseintolerant",None:"Unabletodetermine(genotypenotpresent)")LikelylactosetolerantNotethatreadingGWASpapersforhobbyistscanbeabittricky.Ifyouareahobbyist,besuretospendsometimereadingthepaperclosely,checkingupSNPsonplaceslikeSNPedia<http://snpedia.com>,dnSNP<https://www.ncbi.nlm.nih.gov/projects/SNP/>andOpenSNP<https://opensnp.org/genotypes>.Finally,havefun,butbeextremelycarefulaboutdrawingconclusionsfromyourresults.Commandlineinterface======================Youcanalsoinvokearvfromthecommandline:..code::bash python -m arv --help

For example, you can drop into a Python REPL like so:

.. code:: bash

pythonmarvreplgenome.txtgenome.txt...960614SNPs,maleTypegenometoseetheparsed23andMerawgenomefile>>>genome<Genome:SNPs=960614,name=genome.txt>>>>genome["rs123"]<SNP:chromosome=7position=24966446genotype=<GenotypeAA>>Ifyouspecifyseveralfiles,youcanaccessthemthroughthevariablegenomes.Theexampleatthetopofthisdocumentcanberunwithexample:..code::bash python -m arv --example genome.txt
genome.txt ... 960614 SNPs, male

genome.txt ... A man with blue eyes and light skin

License
=======

Copyright 2017 Christian Stigen Larsen

Distributed under the GNU GPL v3 or later. See the file COPYING for the full
license text. This software makes use of open source software; see LICENSES for
details.

.. |travis-status| image:: https://travis-ci.org/cslarsen/arv.svg?branch=master
:alt: Travis build status
:scale: 100%
:target: https://travis-ci.org/cslarsen/arv

.. |license| image:: https://img.shields.io/badge/license-GPL%20v3%2B-blue.svg
:target: http://www.gnu.org/licenses/old-licenses/gpl-3.en.html
:alt: Project License

.. |versions| image:: https://img.shields.io/badge/python-2%2B%2C%203%2B-blue.svg
:target: https://pypi.python.org/pypi/arv/
:alt: Supported Python versions

.. |pypi| image:: https://badge.fury.io/py/arv.svg
:target: https://badge.fury.io/py/arv

Supported by

AWS Cloud computing and Security Sponsor Datadog Monitoring Fastly CDN Google Download Analytics Pingdom Monitoring Sentry Error logging StatusPage Status page