h2o world - sparkling water on the spark notebook: interactive genomes clustering - xavier tordoir

Sparkling Water on the Spark Notebook: Interactive Genomes

clusteringWhy you must care, by Data Fellas

Xavier Tordoirxtordoir@data-fellas.guru

@xtordoir

● Apache Spark● Interactivity: Spark notebook● Genomics on Spark: ADAM● Data exploitation● H2O w/ Spark: Sparkling water● Show time● Streamlining dev/deployment

Lineup

Can’t wait!

Data Fellas

Andy Petrella

MathsGeospatialDistributed Computing

Spark NotebookTrainer Spark/ScalaMachine Learning

Xavier Tordoir

PhysicsBioinformaticsDistributed Computing

Scala (& Perl)trainer SparkMachine Learning

Distributed computing framework

Large Scale Data Processing engineI play BIG!

What is Apache Spark?

Large Scale Data Processing engine

● SQL & Dataframes● Streaming● Graph Processing● Machine Learning

With all colors!

● Optimize memory usage (FAST)● Optimize computation execution

(Complex tasks)● Easy programming model

Checking in cache If I remember...

● Interactive● @ any scale

http://spark-notebook.io

Laurel? HArdy? Anyone?

● Scala (types, production quality)● Reactive&pluggable charts API

(scala = no.js)● easy install, no deps.● multiple sparkContext

out of the box.

http://bdgenomics.org/

ADAM Project (UC Berkeley):

● Data format (schema, compact, distributed): avro + parquet

● API (Reads, Variants, Genotypes, …)

I, ADAM

Genomics with Spark?

1000 genomes: http://www.1000genomes.org/

~1000 samples

~30M Genotypes per sample (features)

GenomicsThe data

Please, don’t mind the colors...

GenomicsThe data

So… that’s what separates us huh?

~1000 samples

Few samples => Machine Learning

GenomicsThe data

Woooow, really, you must be kidding me… ahahahahah

~1000 samples

~30M Genotypes per sample (features)

Few samples => Machine Learning

Lots of Data => Distributed computing

GenomicsThe data

Oh… damned… hum huh

Population stratification

w/ Deeplearning? H2O

From the spark notebook? Sparkling water

GenomicsThe problem

Here I need some water.

Memory implementation of “Map-Reduce”

Highly optimised structures for the JVM

blazing fast convergent models

Higher API

H2OSparkling: in-memory data exchange

I remember things better with two copies in memory.

http://h2o.ai/product/sparkling-water/

Showtime!

press play...

There’s a notebook for that

“Create” Cluster

Find sources (context, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

Shar3 (Data Fellas)ops

ops data

sci ops

ops data

web ops data

web ops data sci

Shar3 (Data Fellas)Analysis

Production

DistributionRendering

Discovery

CatalogProject

Generator

Micro Service / Binary format

Schema for output

Metadata

Spark and the Notebook are interactive and leverage distributed computing infrastructure

ADAM is an optimized storage format for Massive genomic data

Spark provides tools to manipulate data and works w/ other libraries like H2O

Data scientists and application developers can work together

Summary

Wake up, we’re back!

Acknowledgements

Frank NothaftMatt Massie

Neil Fergusson

Vinod & Michal

Thank you For your attention!

Questions?

And now let’s talk.

h2o world - sparkling water on the spark notebook: interactive genomes clustering - xavier tordoir

Software

h2o for iot - jo-fai (joe) chow, h2o

gbm in h2o with cliff click: h2o api

h2o world - h2o rains with databricks cloud

mineral h2o

genes and genomes. genome on line database (gold) 243...

imperia h2o

h2o on hadoop - amazon...

h2o recycle

h2o voltage

h2o worksheets

h2o world - h2o for genomics with hussam al-deen ashab

water (h2o )

h2o residences

tablas h2o

h2 o glue other products other proro h2o · h2o glue stic...

water (h2o)

nephrol. dial. transplant. 2007 tordoir ii88 ii117

central metabolism cofactor biosynthesis · ppp9 pi h h2o...

h2o sportz

week 1 h2o properties, solutes interactions & types of h2o