h2o world - sparkling water on the spark notebook: interactive genomes clustering - xavier tordoir

Post on 24-Jan-2018

1.503 Views

Category:

Software

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Sparkling Water on the Spark Notebook: Interactive Genomes

clusteringWhy you must care, by Data Fellas

Xavier Tordoirxtordoir@data-fellas.guru

@xtordoir

● Apache Spark● Interactivity: Spark notebook● Genomics on Spark: ADAM● Data exploitation● H2O w/ Spark: Sparkling water● Show time● Streamlining dev/deployment

Lineup

Can’t wait!

Data Fellas

Andy Petrella

MathsGeospatialDistributed Computing

Spark NotebookTrainer Spark/ScalaMachine Learning

Xavier Tordoir

PhysicsBioinformaticsDistributed Computing

Scala (& Perl)trainer SparkMachine Learning

Distributed computing framework

Large Scale Data Processing engineI play BIG!

What is Apache Spark?

Distributed computing framework

Large Scale Data Processing engine

● SQL & Dataframes● Streaming● Graph Processing● Machine Learning

With all colors!

What is Apache Spark?

Distributed computing framework

Large Scale Data Processing engine

● Optimize memory usage (FAST)● Optimize computation execution

(Complex tasks)● Easy programming model

Checking in cache If I remember...

What is Apache Spark?

Distributed computing framework

Large Scale Data Processing engine

● Interactive● @ any scale

http://spark-notebook.io

Laurel? HArdy? Anyone?

What is Apache Spark?

● Scala (types, production quality)● Reactive&pluggable charts API

(scala = no.js)● easy install, no deps.● multiple sparkContext

out of the box.

What is Apache Spark?

http://bdgenomics.org/

ADAM Project (UC Berkeley):

● Data format (schema, compact, distributed): avro + parquet

● API (Reads, Variants, Genotypes, …)

I, ADAM

Genomics with Spark?

1000 genomes: http://www.1000genomes.org/

~1000 samples

~30M Genotypes per sample (features)

GenomicsThe data

Please, don’t mind the colors...

GenomicsThe data

So… that’s what separates us huh?

1000 genomes: http://www.1000genomes.org/

~1000 samples

Few samples => Machine Learning

GenomicsThe data

Woooow, really, you must be kidding me… ahahahahah

1000 genomes: http://www.1000genomes.org/

~1000 samples

~30M Genotypes per sample (features)

Few samples => Machine Learning

Lots of Data => Distributed computing

GenomicsThe data

Oh… damned… hum huh

Population stratification

w/ Deeplearning? H2O

From the spark notebook? Sparkling water

GenomicsThe problem

Here I need some water.

Memory implementation of “Map-Reduce”

Highly optimised structures for the JVM

blazing fast convergent models

H2O

Higher API

H2OSparkling: in-memory data exchange

I remember things better with two copies in memory.

http://h2o.ai/product/sparkling-water/

Showtime!

press play...

There’s a notebook for that

“Create” Cluster

Find sources (context, quality, semantic, …)

Connect to sources (structure, schema/types, …)

Create distributed data pipeline/Model

Tune accuracy

Tune performances

Write results to Sinks

Access Layer

User Access

Shar3 (Data Fellas)ops

data

ops data

sci

sci ops

sci

ops data

web ops data

web ops data sci

Shar3 (Data Fellas)Analysis

Production

DistributionRendering

Discovery

CatalogProject

Generator

Micro Service / Binary format

Schema for output

Metadata

Spark and the Notebook are interactive and leverage distributed computing infrastructure

ADAM is an optimized storage format for Massive genomic data

Spark provides tools to manipulate data and works w/ other libraries like H2O

Data scientists and application developers can work together

Summary

Wake up, we’re back!

Acknowledgements

Frank NothaftMatt Massie

Neil Fergusson

Vinod & Michal

Thank you For your attention!

Questions?

And now let’s talk.

top related