genome sequencing and analysis program millions …files.meetup.com/469457/millions of genes with...

Post on 08-Jul-2020

5 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Millions of Genes with Python and Jython

Clint HowarthJanet Dewar, Maia Hansen, Jeffrey Larimer,

Matthew Pearson, Andrew RobertsShailaja Gargeya, Jennifer Wortman,

Cheryl Murphy, Bruce Birren

Analysis and Annotation EngineeringGenome Sequencing and Analysis Program

Broad Institute

Annotation and Analysis

MalariaTuberculosisHIVWest NileDengue fevere. coliStreptococcusStaphylococcus aureus

Human Microbiome Project

Table of Contents: Applications

Internal ● java/jython analysis

and publication platform

● oliveweb publication

Open Source ● toothpick

data abstraction layer ● genepidgin

gene names ● accordion

genetics over time

Annotation sample problemCCCCAAGCTCACTGATTGACGGTGCTCTGATTGCGCAACCAGACAACGACGACAATGAGGGTGCTACTGTCTTTTCTCATTATAGGGCTGCTAGGCATTGCCAGTGCCCTCAGCACCAGTGGAAACAGATTACTGGTGGTACTGGAGGAGCTTGCGGAGAAGGACAAGTACTCCAAGTTCTTTGGGGATCTGAAAGGCAAGCGCACGGCTGGGAGATAGAACTGGCAGGGACAACGGGATTCTAGCTAACTTTTTGAGGATTGTAGGTCGGGGCTTTGATATCACATTTGAATCTCCAAAGAGTGATAGTTTGGCGCTGTTCGAGCTTGGCGAGAGAGCTTATGATCACCTTCTCATCCTCCCATCTAAGTCGAAAGGTCAGCATCCATTGAACGACATTGGACTGTCGTGTGCTAATATAATAGGCCTGGGACCAAACCTCACGCCCCAAACCTTATTGAAGTTCATCAATACCGAGGGCAACATCCTGCTCACCCTATCTTCATCCAACCCGATACCATCAGCTCTCGTATCAGTTCTGCTTGAGCTTGACATCCATCTCCCCACCGACCGCAACTCGATAGTGGTCGACCACTTCAACTACGACAGCCTCTCGGCCCCCGAATCCCACGATGTCGTTCTTGTTCCCCGCCCAAGCGCTGTGCGCCCCGGTGTTCGCAACTTCTTCGGCGGCATCCTCAAGAACGAGGTTATCGCGTTCCCCCACGGCGTGGGCCAGACTCTAGGCAACGATAGCCCGTACTTGACACCGGTCCTTCGCGCCCCCGGCACGGCGTACTCATACAACCCCAAGGAGGAGGCCGAGGCTGTGGAGGACCCGTTCGCGGTTGGCCAGCAACTGTCCCTCGTGACCGCCATGCAGGCTCGCAACTCAGCGAGGTTCACTGTCTTGGGCGCAGCGGAAATGCTTGAGGATAAGTGGTTCAAGGGGAAGGTCCAAGTTGCTGGCGGCAAGGTTGCTGCGGCTGCGAATGAGGCGTTTGCGAAGGAGATCTCCGGATGGACTTTCAATGAGGCTGGAGTCCTGAAGGTGAAGTCTGTTACGCATTTCCTCAACGAAGAGGGGTTGAAAACACCCAATGCTTCATTGACGAACCCCAAGATCTACCGTGTCAAGAACACTGTTGTAAGTGGATTTTCTGTGCCAAATGTGAGAATCATATGCTAACTATGTCTAGACTTACTCGATTGAGCTATCTGAATGGTCGTGGAAGGAGTATGTACCCTTCGTACCCGCCACCGGTGATGATGTGCAGCTTGAGTTCTCTATGCTCTCGCCCTTCCACCGACTGAATCTGGAGCGCACTCAGACGAATCCTTCATCTAGCGTCTTCAGCACCACATTCAAGCTTCCAGATCAGCATGGAATCTTCAACTTCCTGGTCGAGTTCCGCCGCCCCTTCCTCTCGAACATCGAGGACAAGAAAACGGTCACCGTACGCCACTTTGCACACGATGAATGGCCACGCAGTTGGGTCATCAGCGCCGCGTGGCCATGGATCTCTGGCGTTGCGGTCACTGTTGTTGGATGGATTATATTCGTGGGATTGTGGTTGTACAGTGCCCCACCGACAGTGAAGGGAAAGAAGTCGCGATGAGAAGAGCTAGATGTTGCATTTGAGACGTAAACGGGACTGTATGAATACCAAAATCGTGTATAGATGATATAGTAGGCAGGAACACATGGCATGTCTGATTCCGAATAAATCGCTCGTATTGCCTTGCGCGCTTGTTGATTGTACACGGTTGTG

Annotation sample solutionCCCCAAGCTCACTGATTGACGGTGCTCTGATTGCGCAACCAGACAACGACGACAATGAGGGTGCTACTGTCTTTTCTCATTATAGGGCTGCTAGGCATTGCCAGTGCCCTCAGCACCAGTGGAAACAGATTACTGGTGGTACTGGAGGAGCTTGCGGAGAAGGACAAGTACTCCAAGTTCTTTGGGGATCTGAAAGGCAAGCGCACGGCTGGGAGATAGAACTGGCAGGGACAACGGGATTCTAGCTAACTTTTTGAGGATTGTAGGTCGGGGCTTTGATATCACATTTGAATCTCCAAAGAGTGATAGTTTGGCGCTGTTCGAGCTTGGCGAGAGAGCTTATGATCACCTTCTCATCCTCCCATCTAAGTCGAAAGGTCAGCATCCATTGAACGACATTGGACTGTCGTGTGCTAATATAATAGGCCTGGGACCAAACCTCACGCCCCAAACCTTATTGAAGTTCATCAATACCGAGGGCAACATCCTGCTCACCCTATCTTCATCCAACCCGATACCATCAGCTCTCGTATCAGTTCTGCTTGAGCTTGACATCCATCTCCCCACCGACCGCAACTCGATAGTGGTCGACCACTTCAACTACGACAGCCTCTCGGCCCCCGAATCCCACGATGTCGTTCTTGTTCCCCGCCCAAGCGCTGTGCGCCCCGGTGTTCGCAACTTCTTCGGCGGCATCCTCAAGAACGAGGTTATCGCGTTCCCCCACGGCGTGGGCCAGACTCTAGGCAACGATAGCCCGTACTTGACACCGGTCCTTCGCGCCCCCGGCACGGCGTACTCATACAACCCCAAGGAGGAGGCCGAGGCTGTGGAGGACCCGTTCGCGGTTGGCCAGCAACTGTCCCTCGTGACCGCCATGCAGGCTCGCAACTCAGCGAGGTTCACTGTCTTGGGCGCAGCGGAAATGCTTGAGGATAAGTGGTTCAAGGGGAAGGTCCAAGTTGCTGGCGGCAAGGTTGCTGCGGCTGCGAATGAGGCGTTTGCGAAGGAGATCTCCGGATGGACTTTCAATGAGGCTGGAGTCCTGAAGGTGAAGTCTGTTACGCATTTCCTCAACGAAGAGGGGTTGAAAACACCCAATGCTTCATTGACGAACCCCAAGATCTACCGTGTCAAGAACACTGTTGTAAGTGGATTTTCTGTGCCAAATGTGAGAATCATATGCTAACTATGTCTAGACTTACTCGATTGAGCTATCTGAATGGTCGTGGAAGGAGTATGTACCCTTCGTACCCGCCACCGGTGATGATGTGCAGCTTGAGTTCTCTATGCTCTCGCCCTTCCACCGACTGAATCTGGAGCGCACTCAGACGAATCCTTCATCTAGCGTCTTCAGCACCACATTCAAGCTTCCAGATCAGCATGGAATCTTCAACTTCCTGGTCGAGTTCCGCCGCCCCTTCCTCTCGAACATCGAGGACAAGAAAACGGTCACCGTACGCCACTTTGCACACGATGAATGGCCACGCAGTTGGGTCATCAGCGCCGCGTGGCCATGGATCTCTGGCGTTGCGGTCACTGTTGTTGGATGGATTATATTCGTGGGATTGTGGTTGTACAGTGCCCCACCGACAGTGAAGGGAAAGAAGTCGCGATGAGAAGAGCTAGATGTTGCATTTGAGACGTAAACGGGACTGTATGAATACCAAAATCGTGTATAGATGATATAGTAGGCAGGAACACATGGCATGTCTGATTCCGAATAAATCGCTCGTATTGCCTTGCGCGCTTGTTGATTGTACACGGTTGTG

>>> transcript1 = s.get("from Transcript t where t.locus='EUKG_05092")>>> transcript1.length798 # not just a field, but live object>>> transcript1.containsInFrameStop()1>>> overlaps(transcript1, transcript2)0

Jython Interpreter Access

Analysis and Annotation scale

2004manual annotation and publicationfour genomes / year 2012high-throughput process, manually iteratedthirty-six genomes in one day

Long-term Success

Over the past eight years, this Java/Jython analysis platform has: ● 10K+ genomes annotated● 2M+ genes published● 2M+ jobs distributed across 1000+ nodes● 5TB+ of genomic data and analyses

previous publication platform deployed individual / small groups of genomes each one very customizable common data duplicated via cut / paste two hundred settings per genomefive hundred settings per publication

web publishing changing scale

annual use● 250k researchers● 3M pageviews our Java/Tapestry stack couldn't keep up● slow response, render time :(● restart tomcat a lot

genomes en masse

olive.broadinstitute.org

● Oracle-to-Java data model via Hibernate

● RESTful Java data service

● Python data model layer (Toothpick)

● Python.Flask web service (olive)

olive navigation

toothpick: data abstraction layer

author: Andrew Roberts Modeling data from separate data sources Single models with live references to multiple sources open source coming 2012Q3-Q4

An analysis project is composed of ● genomes

annotations, analyses, etc

● initiativesgrant info, sample tracking, status, etc

toothpick: multiple sources

toothpick: models with friends@cache.cached_model(ttl=86400)@TopspinAdapter.collection("all", path="projects.json")@TopspinAdapter.resource("id", path="projects/%s.json")class Project(toothpick.Base): genomes = toothpick.has_many("Genome", data_field="genome_edition_ids") initiative = toothpick.belongs_to("Initiative", "squid_id", soft=True) def _display_title(self): return self.short_name ...

A genome is aware of what project it's part of: views/genome_views.py@app.route("/genomes/<project_url>.<int:version>")def show_genome(genome_id=None): genome = toothpick.fetch_model( genome_url_and_version=(project_url, version)) models/genomes/show.html.jinja...{{ funding_via_initiative(genome.project.initiative) }}

olive and toothpick: simple use

naming genes

Naming genes is hard Naming genes based on what people attach to the description field of other genes is harder

naming example

"BT002689 glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]"

genepidgin example>>> orig_name = "BT002689 glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]">>> (cleaned_name, etymology) = gpidg.cleanup(orig_name)>>> cleaned_name"glycine/betaine/L-proline ABC transporter">>> etymologyfiltered name in 4 steps:...4) reason: delete notes after commas, dashes, semicolon--except when followed by family or superfamily pattern: [-,;]\s+(?!family)(?!superfamily).* filtered: glycine/betaine/L-proline ABC transporter

genepidgin.sf.net

open source and freely available since 2010 named millions of genes being used to rename TIGRfam, one of the core hand-curated protein libraries

accordion: annotations over time

Navigating genomic data over time is challenging● Scientists refer to individual genes (loci) in

studies● MCBG_00123.1● Loci are database identifiers, but nobody

owns the primary key index

Genes can be removed, added, split, merged: ● 1st: MCBG_00123.1● 2nd: MCBG_00123.2● 3rd: MCBG_00123.3 MCBG_02786.3 Loci are kind of the wild west

accordion: cooperative identifiers

accordion: annotations over time

Can't fix loci, so fix the mechanical stuff that follows Match sequence to sequence, gene to gene, let people walk over genomes, over time Nobody's reference to shiga toxin will be lost or confused open source coming 2012Q4ish

Thanks!

top related