spark meetup london share and analyse genomic data at scale with spark, adam, tachyon and the spark...
TRANSCRIPT
by Data Fellas, Spark London Meetup July, 1st ‘15
Share and analyse genomic dataat scale with Spark, Adam, Tachyon and the Spark Notebook
PART IAdam: genomics on Spark1K Genomes in Adam on S3Explore: Compute StatsLearn: train a model
OutlinePART IIGA4GH: Standard for Genomicsmed-at-scale projectExplore: using StandardsCreate custom micro services
Andy Petrella@noootsab
MathsscalaApache Spark
Spark NotebookTrainerData Banana
Xavier Tordoir@xtordoir
PhysicsBioinformatics
ScalaSpark
PART ISpark & Genomics
Adam: genomics on Spark
1K Genomes in Adam on S3
Explore: Compute Stats
Learn: train a model
So that’s the thing that separates us?
AdamWhat is genomics data
Okay, sounds good. Give me two of them!
Genome is an important factor in health:
Medical DiagnosticsDrug responseDiseases mechanisms …
AdamWhat is genomics data
You mean devs are slacking of?
On the data production:
Fast biotech progress
No so fast IT progress?
AdamWhat is genomics data
No! They’re just sticky bubbles...
On the data production:
Sequence {A, T, G, C}
3 billion bases
AdamWhat is genomics data
Okay, a lot of bubbles.
On the data production:
Sequence {A, T, G, C}
3 billion bases
… x 30 (x 60?)
AdamWhat is genomics data
C’mon. a big mess of plenty of lil’ bubbles then.
On the data production: massively parallel
Sequence {A, T, G, C}
3 billion bases
… x 30 (x 60?)
AdamWhat is genomics data
Ah that explain why the black bars are differents
AdamWhat is genomics data
Dude... Tens of millions
AdamWhat is genomics data
Staaaaaaph Tens of millions
1000’s1,000,000’s…
AdamWhat is genomics data
‘coz it makes sparkling bubbles, right?
Ok, looks like Apache Spark makes a lot of sense here …
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
AdamAn understandable model
Well done, a spec as text in a pDf…
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
AdamAn understandable model
Take that
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
AdamAn understandable model
Dunno what is a Genotype but it contains a Variant.Apparently.
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
AdamAn understandable model
Yeaaah:generate client == more slack
Adam provides an avro schema
AdamAn efficient storage
Machism in I.T., what a flaw!
● Distribute data● Schema based● Read/query efficient● Compact
AdamAn efficient storage
That’s a quick step
● Distribute data● Schema based● Read/query efficient● Compact
PARQUET!
AdamAn efficient storage
Is Eve okay to use the parquet for that?
● Distribute data● Schema based● Read/query efficient● Compact
PARQUET!
Adam provides parquet as storage format
AdamA clean API
Object Wrappedy
adam Context
AdamA clean API
I could have done this as a one liner
adam Context
IO methods
AdamA clean API
At least, it’s going to be simpler than the chemistry
● Scala classes generated from Avro● Data loaded as RDDs ● functions on RDDs
○ write to HDFS○ genomic objects manipulations○ Primitives to query genomics
datasets
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
AdamPart of a pipeline
human | Seq | SNAP | Avocado | Adam | Ga4gh
ADAM is JVM library leveraging- Spark- Avro- Parquet
It still needs to be combined with sources (snap)
Adam data is part of processes (AVOCADO).
It CAN ALSO BE THE SOURCE FOR external PROCESSING, LEARNING (LIKE mllIB).
Thousands GenomesOpen Data Set
Games without Frontiers
1000 genomes: http://www.1000genomes.org/
Produces BAMs, VCFs, ...
Thousands Genomes
Why do you complain, they are compressed …
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Thousands GenomesWhere are the data
DNA Russian roulette: which is fastest?
● EBI FTP: ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/
● NCBI FTP: ftp://ftp-trace.ncbi.nih.gov/1000genomes/ftp/
● S3: http://aws.amazon.com/1000genomes/
● GS: gs://genomics-public-data/ftp-trace.ncbi.nih.gov/1000genomes/ftp
Thousands GenomesAdam that shit on S3
Hmmm like in the good old days of HPC
The bad part …
● get the vcf.gz file on local disk (& time for a coffee)
● uncompress (& go for lunch) ● put in HDFS (& take dessert)
Thousands GenomesAdam that shit on S3
what? No grappa?
The good part …
the Notebook (this one)
Thousands GenomesAdam that shit on S3
Okay, good enough to wait a bit…
What did we gain?
● before: 152 GB (gzipped) in 23 files● After: 71 GB in 9172 partitions
(43,372,735,220 genotypes)
Explore GenomicsAccess the data
Just in case, you don’t believe us -_-’
Access data from this notebook
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Explore GenomicsCompute statistics
We’re there to compute, right?
Compute Freqs from this spark notebook
Learn GenomicsThe problem
Insane, you’ll have hard time with me |:-[
How to deal with heterogenous data?
● Population stratification● Identify natural clusters● Assign genomes to these clusters
Learn GenomicsThe dimensions
Wiiiiiiiiiiiiiiiiide rows
● 1000 Samples (Rows)● 30,000,000 variants (columns or
variables)
Hard to explore such a feature space…
Learn GenomicsThe dimensions
*LDA for Latent Dirichelet Allocation…
Dimensionality reduction?
● Ideal would be a “Genetic” Mixture measure (lda* would do that…)
● Or a genetic distance (edit distance)
KMeans & distances to centroids
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Learn GenomicsThe model
Reduce, train, validate, infer
● Split training/validation set● Train KMeans with 25 clusters● Compute distances to each centroid as
new features● Train Random Forest ● Validation
Learn GenomicsThe notebook
Define and train the model in this Notebook
The whole shebang?
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
AdamOur pipeline
I am a Llama
Convert VCFs to ADAMStoRE ADAM to S3
Compute alleles frequenciesStore alleles frequencies to S3
Compute Minor Allele frequency distribution
Train a Model for stratification
Hmmm… quite some missing pieces, right?
PART IIStandards & Micro Services
Wake up!
GA4GH: Standard for Genomics
med-at-scale project
Explore: using Standards
Create custom micro services
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Ga4GHLet’s fix the baseline
In I.T. it’s easy everything is standardized…
Global Alliance for Genomic and Health
http://genomicsandhealth.org/http://ga4gh.org/
Framework for responsible data sharing● Define schemas● Define services
Along with Ethical, Legal, security, clinical aspects
GA4GHmodels
… everybody has is own standard
GA4GHServices
But a shared schema is a bit better!
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
GA4GHMetadata
The data of my data is also my data
Work In Progress
● Individual ● Sample ● Experiment ● Dataset ● IndividualGroup ● Analysis
But still very young and too much centered on data
Beacon ⁽*⁾
Tells the world you have data.CLearly not enough
Med At ScaleBy Data Fellas
Existing scalable implementation:Google Genomics
Uses ● BigQuery● google cloud computing● dremel● …
That’s what happens when you think you have…
Med At ScaleBy Data Fellas
Google Genomics is pushing Hard
…
Med At ScaleScalability first
BIG
There is another scalable implementation:Med At Scale, by Data Fellas
Uses ● Apache Spark● Adam● S3● HDFS● …
Med At ScaleScalability first
Data Fellas is pushing TOO
BIG
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Med At ScaleComposability
very BIG
GA4GH defines quite some methods, or services
They don’t have all the same requirements in term of exposure and data processing
→ micro services for the Win
Allows granular deployment and composition/chaining of methods to answer a global question
Med At ScaleCustomization
Data Fellas is a data science company
Thus our goal is to expose data analyses
A data analysis is ● elaborated in a notebook● validated on a cluster● deployed as a micro service it self
Still defining a Schema and Service
VERY VERY BIG
Med At ScaleReady for the load
Balls!
We saw that one row has
30,000,000 columns
The queries are slicing and dicing those columns → views are huge
Hence, Tachyon via RDD.persist/save will optimize the collocated queries in space and time.
The hard part (will/)is to size the tachyon cluster
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Med At ScaleAd Hoc Analytics
Who left the rats out?
Standards are very important
However, they cannot define everything, mostly OLAP.
Ad-Hoc analytics are thus allowed on the raw data using Apache Spark directly.
Of course, interactivity is a key to performance… hence the Spark-Notebook is involved.
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Med At ScaleHow it works
Finally…
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Med At ScaleADAM (and Spark)
Finally…
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Med At ScaleMLlib (and Spark)
Finally…
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Med At ScaleEfficient binary data
Finally…
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Med At ScaleMicro Service
Finally…
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
Med At ScaleCache and Collaboration
Finally…
ExploreUsing GA4GH endpoints
notebook TIME!
Use scala/Java Avro client from the browser.
I give you BananasYou give me Ananas
CustomizeCreate and Use micro service (WIP)
Planning the next gear
Remember the frequencies use case? There is a custom endpoint manually created
We’re working on an Integrated Workflow
In a notebook: ● create the process● create Cassandra schema● persist (using connector)● Define service AVRO IDL● Generate project for DCOS● Log usage (see next)
TIPS 1:
Lorem Ipsum is simply dummy text of the printing and typesetting industry.
OptimizationQuery mining (Roadmap)
Always look at the bright side
Back to the high dimensionality problem
Caching beforehands is a good solution but is not optimal.
Plan: ANalyse the Request/Response objects and the gathered runtime metrics to adapt the caching policies -- query mining processes
ReferencesAdam: https://github.com/bigdatagenomics/adamBdg-Formats: https://github.com/bigdatagenomics/bdg-formats
GA4GH website: http://genomicsandhealth.org/GA4GH data working group: http://ga4gh.org/
Spark-Notebook: https://github.com/andypetrella/spark-notebook/
Med-At-Scale: https://github.com/med-at-scale/high-health
Data Fellas: http://data-fellas.guru/
Q/A⁽*⁾THANKS!
⁽*⁾ or head to the pub (at least beers…)