biobankcloud: machine learning on genomics + ga4gh @ med at scale

Post on 16-Jul-2015

394 Views

Category:

Technology

4 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Scalable genomic data processing and interoperable systems with ADAM/Spark

Andy PetrellaXavier Tordoir

2015-02-19

LineupIntro

● who we are ● we do distributed computing

Abstract● Content: Distributed machine learning on

genomes data● Distributed data and processing (S3, Spark,

Tachyon)● Distributed machine learning (MLlib, H2O)● Spark Notebook

Context● 1000 genomes in VCF● Distributed genomic data in ADAM● Size matters (VCF → ADAM + partitioned)● Data available on S3 (s3://med-at-

scale/1000genomes)● Stratification

Procedure● Deploy Spark on ec2● Deploy Spark Notebook● Load data● Clean data● Transform data● Train KMeans

Results● Prediction (confusion matrix)● Performance

On the bench● GA4GH compliant and scalable server● Ad hoc analyses and sharing (through Tachyon)

Andy

@Noootsab, I am@SparkNotebook creator@Devoxx4Kids organizerMaths & CSScalable systems Machine learning

Med@Scale

Xavier

@xtordoirPhysicsData analysisGenomicsDistributed computing

Products (OSS)● SparkNotebook● GA4GH server

What we do?Distributed computing consultancy in

● Internet of Things● Finance● Geospatial● Marketing

Training and coaching in● Scala● Spark● Distributed architecture● Distributed machine learning

Research and development● Distributed machine learning models● Genomics and health

Data: 1000genomes (Genotypes + Samples Population)

- Quite some data → real scalability test

- Machine learning:- Genotype inference- Population classification (supervised learning)- Population stratification (unsupervised learning)

Distributed Machine Learning on Genotypes Data

The era of distributed computing

Strong Open Source ecosystem, Industrial developments and research

- Infrastructure can be elastic (e.g. EC2/S3)

- Data storage: HDFS (large blocks…), S3 (remote...)

- Processing: Beefed up MapReduce: Spark

- Escaping the IOPs: Tachyon in-memory filesystem

- Scheduling, HA (Mesos, Marathon)

Distributed Data Processing

BerkeleyDataAnalyticsStackmore here

Distributed Data Processing

SparkNotebook

Interactive Distributed Computing

Dev’ timeDev’ time

Dev’ timeDev’ timeDev’ time

Dev’ timeDev’ time

Distributed Genomic Data1000 genomes1092 samples43,372,735,220 genotypes

Original DataVCF not partitioned files on FTP or S3: 152 GB (gzipped)VCF format not easily parallelizable, even worst with compression

Adam / med-at-scaleADAM files S3: 70.75GB (parquet, compressed)9172 partitions (7Mb each)@see http://med-at-scale.s3.amazonaws.com/1000genomes/counts.html

Eggo projecthttps://github.com/bigdatagenomics/eggo

DataWe have the 1000 genomes data, hence

- we have genotypes- we have samples population labels

ExplorationWe can cluster samples.We can compare with samples populations.

ModelWe can run simple stratification algorithms, K-Means.

Technology assessment

K-Means

MLLib provides K-Means (not hierarchical) → limit to 3 populations

MLLib uses breeze linalg library → Only euclidean metric (at that moment)

AT 1

AA 0

TT 2

A ref allele

11

2

ProcedureSpark on EC2 cluster

- spark-ec2 script

- 2 to 40 workers (x 13GB / 4 cores)

- 10 to 40 minutes to launch Driver

Worker

Worker

Worker Worker

$ ./spark-ec2 launch

ProcedureSparkNotebook on EC2 cluster

- access from your browser

- configure spark

- control computations on the cluster

Driver

Worker

Worker

Worker

Worker

ProcedureLoad data

- Read ADAM data from S3 repo

- Read the samples populations

Worker

WorkerWorker

Worker

Driver

ProcedureFilter and clean data

- Sample: chromosome slice (chr22), 3 populations (GBR, ASW, CHB)

- Missing genotypes (remove incomplete variants)

Variant1 Variant2 Variant3 Variant4 Variant5 Variant6 Variant7

Sample1 0 0 1 0 1 0 1

Sample2 2 NA 1 2 1 0 0

Sample3 2 0 1 2 2 0 2

Sample4 1 1 0 0 0 NA 0

ProcedureTransform data

- Flat Genotype collection → Sample collection

- Each Sample is a Vector of Genotypes (0, 1, 2)

- Vector is ordered consistently

Genotype

Variant

Sample (ID)

Alleles

Sample

Sample (ID)

Vector[Genotype]

Vector[Variant]

ProcedureTrain K-Means

- 10 iterations

- 3 clusters

Sample

Sample (ID)

Vector[Genotype]

Vector

Vector

Vector

Results

~ 100,000 variants

#0 #1 #2GBR 0 0 89ASW 54 0 7CHB 0 97 0

The procedure reconstructs the actual populations.

Results

Performance (cluster size)

2 NODES 20 NODES(*)

Cluster Launch 10 min 30.0 min

Count chr22 genotypes (S3) 6 min 1.1 min

Save chr22 from s3 to HDFS 26 min 3.5 min

Count chr22 genotypes (HDFS) 10 min 1.4 min

(*) Cluster size / nb partitions not optimal here: 80 cores / 114 partitions

Results

Performance (cluster size)

121,023 Variants 2 NODES 20 NODES

Missing data (collect) 7.8 min 33 sec

Train (10 iter) 2.1 min 28 sec

Predict (collect) 8 sec 2 sec

Results

Performance, 20 NODES (data size)

121,023 Variants

491,222 Variants

Missing data (collect) 33 sec 3.7 min

Train (10 iter) 28 sec 1.6 min

Predict (collect) 2 sec 25 sec

On the benchGlobal Alliance for Genomic and Health (GA4GH)

http://genomicsandhealth.org/http://ga4gh.org/

- Framework for responsible data sharing- Define schemas- Define services for interoperability

On the benchGA4GH schemas

On the benchGA4GH google implementation

On the benchGA4GH google implementation

On the benchGA4GH compliant & scalable server

Open source and available on GitHub,https://github.com/med-at-scale/high-health

PRs are welcome!

On the bench

Methods grouped in micro services

GA4GH & Custom methods

Thank you

Biobankcloud, KTH (Jim Dowling)UC Berkeley AMPLab, bdgenomics.org team (Frank Nothaft, Matt Massie)Cloudera (Uri Laserson)

Hey…

Come back tomorrow morning → for demos

And afternoon → to hack on it!

top related