andy petrella_med@scale by data fellas: scalable and interoperable genomics data services, what...
TRANSCRIPT
![Page 1: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/1.jpg)
by Data Fellas, Data Enthusiasts v 4.0 (July, 13th ‘15)
Scalable and Interoperable data servicesApplied to Genomics
![Page 2: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/2.jpg)
Young Belgian Startup
The Data Fellas Startup
Data ScienceXavier Tordoir@xtordoir
Andy Petrella@noootsab
Data Processing
Scalable Machine Learning
Micro Services oriented
![Page 3: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/3.jpg)
Data Fellas EcosystemWe’ve worked with
![Page 4: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/4.jpg)
Data Fellas: EvangelizingTrainingScala
Apache Spark (BE, in September)http://spark4devs.data-fellas.guru/
Distributed Machine Learning
Pipeline (Oakland, August)http://bigdatascala.bythebay.io/training.html
Apache Spark (SFO with BoldRadius, August)
TalksScala IO, Devoxx Belgium, Devoxx France, Scala Days, KTH, KUL, Spark Meetup London, …
more to come (Italy, …)
PMC Member at Strata NYPMC member at DevoxxPMC Member at Foss4G
![Page 5: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/5.jpg)
First: Data ScienceAnalysis
Spark Notebook
![Page 6: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/6.jpg)
First: Data ScienceAnalysis
Production
Project Generator
Mesos / C* / DCOS
![Page 7: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/7.jpg)
First: Data ScienceAnalysis
Production
Distribution
Micro Service / Binary format
Marathon
![Page 8: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/8.jpg)
First: Data ScienceAnalysis
Production
DistributionRendering
SChema for output
GG / D3 …
![Page 9: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/9.jpg)
First: Data ScienceAnalysis
Production
DistributionRendering
Discovery
Service Metadata
SOLR , …
![Page 10: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/10.jpg)
First: Data ScienceAnalysis
Production
DistributionRendering
Discovery
CatalogSpark Notebookusing Services too
![Page 11: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/11.jpg)
First: Data ScienceAnalysis
Production
DistributionRendering
Discovery
Share Analyses
Share Results
Share Datasets
![Page 12: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/12.jpg)
First: Data Science
Project Code Name:
Shar3
![Page 13: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/13.jpg)
Next: Applied TO Genomics
Genomics data is pretty big
● 100,000’s genomes in 2015● 1,000,000’s … ● 100,000,000’s … ● …
![Page 14: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/14.jpg)
Next: Applied TO Genomics
Genomics data is pretty big and of High dimensionality
One genome:○ 3 billions bases (basic DNA component) sequence○ 30 - 60 x coverage for quality○ 10’s to 100’s millions variants (variable bases
from one individual to the next)
![Page 15: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/15.jpg)
Next: Applied TO Genomics
e.g. 1000genomes project:
● 200TB compressed data● organised in files/directories● data formatted following specs in a … PDF
Data and services schemas are required
![Page 16: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/16.jpg)
What we do with genomics data?
Lots of Querying and Learning:
E.G.
● Population structure is a fundamental basis● Querying relationships between genomes and other
biological features
Hey… no one has all data!
Metadata
![Page 17: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/17.jpg)
What we do with genomics data?
Lots of Querying and Learning:
E.G.
● We do some specific Modelling on some data…
Hey… no two serve the same computations!
Service Discovery
![Page 18: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/18.jpg)
Interoperability
So, no one has all data … BUT all should be able to talk…
Interoperability (GA4GH)
![Page 19: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/19.jpg)
Interoperable… Analysis
Production
DistributionRendering
Discovery
Share Analyses
Share Results
Share Datasets
![Page 20: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/20.jpg)
Interoperable & scalable…
GA4GH + Shar3 = Med@Scale
+ ADAM & spark+ In Memory optimization (Tachyon)+ Deployment (e.g. DCOS)
![Page 21: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/21.jpg)
Wrap-UP
Follow us @DataFellas and get notified about our
+ sharing platform at scale: Shar3
+ Google Genomics At Home (^.^): Med@Scale
+ future plans: modules for Trading, Geospatial, other medical data, …
![Page 22: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/22.jpg)
ReferencesAdam: https://github.com/bigdatagenomics/adamBdg-Formats: https://github.com/bigdatagenomics/bdg-formats
GA4GH website: http://genomicsandhealth.org/GA4GH data working group: http://ga4gh.org/
@Spark-Notebook: https://github.com/andypetrella/spark-notebook/
Med-At-Scale: https://github.com/med-at-scale/high-health
Data Fellas: http://data-fellas.guru/ Training: http://spark4devs.data-fellas.guru/
![Page 23: Andy Petrella_Med@Scale by Data Fellas: Scalable and Interoperable Genomics data services, what stack to rely on?](https://reader030.vdocuments.us/reader030/viewer/2022032504/55c42b43bb61eb13038b4717/html5/thumbnails/23.jpg)
Q/ATHANKS!