biobankcloud: machine learning on genomics + ga4gh @ med at scale
Post on 16-Jul-2015
394 Views
Preview:
TRANSCRIPT
Scalable genomic data processing and interoperable systems with ADAM/Spark
Andy PetrellaXavier Tordoir
2015-02-19
LineupIntro
● who we are ● we do distributed computing
Abstract● Content: Distributed machine learning on
genomes data● Distributed data and processing (S3, Spark,
Tachyon)● Distributed machine learning (MLlib, H2O)● Spark Notebook
Context● 1000 genomes in VCF● Distributed genomic data in ADAM● Size matters (VCF → ADAM + partitioned)● Data available on S3 (s3://med-at-
scale/1000genomes)● Stratification
Procedure● Deploy Spark on ec2● Deploy Spark Notebook● Load data● Clean data● Transform data● Train KMeans
Results● Prediction (confusion matrix)● Performance
On the bench● GA4GH compliant and scalable server● Ad hoc analyses and sharing (through Tachyon)
Andy
@Noootsab, I am@SparkNotebook creator@Devoxx4Kids organizerMaths & CSScalable systems Machine learning
Med@Scale
Xavier
@xtordoirPhysicsData analysisGenomicsDistributed computing
Products (OSS)● SparkNotebook● GA4GH server
What we do?Distributed computing consultancy in
● Internet of Things● Finance● Geospatial● Marketing
Training and coaching in● Scala● Spark● Distributed architecture● Distributed machine learning
Research and development● Distributed machine learning models● Genomics and health
Data: 1000genomes (Genotypes + Samples Population)
- Quite some data → real scalability test
- Machine learning:- Genotype inference- Population classification (supervised learning)- Population stratification (unsupervised learning)
Distributed Machine Learning on Genotypes Data
The era of distributed computing
Strong Open Source ecosystem, Industrial developments and research
- Infrastructure can be elastic (e.g. EC2/S3)
- Data storage: HDFS (large blocks…), S3 (remote...)
- Processing: Beefed up MapReduce: Spark
- Escaping the IOPs: Tachyon in-memory filesystem
- Scheduling, HA (Mesos, Marathon)
Distributed Data Processing
BerkeleyDataAnalyticsStackmore here
Distributed Data Processing
SparkNotebook
Interactive Distributed Computing
Dev’ timeDev’ time
Dev’ timeDev’ timeDev’ time
Dev’ timeDev’ time
Distributed Genomic Data1000 genomes1092 samples43,372,735,220 genotypes
Original DataVCF not partitioned files on FTP or S3: 152 GB (gzipped)VCF format not easily parallelizable, even worst with compression
Adam / med-at-scaleADAM files S3: 70.75GB (parquet, compressed)9172 partitions (7Mb each)@see http://med-at-scale.s3.amazonaws.com/1000genomes/counts.html
Eggo projecthttps://github.com/bigdatagenomics/eggo
DataWe have the 1000 genomes data, hence
- we have genotypes- we have samples population labels
ExplorationWe can cluster samples.We can compare with samples populations.
ModelWe can run simple stratification algorithms, K-Means.
Technology assessment
K-Means
MLLib provides K-Means (not hierarchical) → limit to 3 populations
MLLib uses breeze linalg library → Only euclidean metric (at that moment)
AT 1
AA 0
TT 2
A ref allele
11
2
ProcedureSpark on EC2 cluster
- spark-ec2 script
- 2 to 40 workers (x 13GB / 4 cores)
- 10 to 40 minutes to launch Driver
Worker
Worker
Worker Worker
$ ./spark-ec2 launch
ProcedureSparkNotebook on EC2 cluster
- access from your browser
- configure spark
- control computations on the cluster
Driver
Worker
Worker
Worker
Worker
ProcedureLoad data
- Read ADAM data from S3 repo
- Read the samples populations
Worker
WorkerWorker
Worker
Driver
ProcedureFilter and clean data
- Sample: chromosome slice (chr22), 3 populations (GBR, ASW, CHB)
- Missing genotypes (remove incomplete variants)
Variant1 Variant2 Variant3 Variant4 Variant5 Variant6 Variant7
Sample1 0 0 1 0 1 0 1
Sample2 2 NA 1 2 1 0 0
Sample3 2 0 1 2 2 0 2
Sample4 1 1 0 0 0 NA 0
ProcedureTransform data
- Flat Genotype collection → Sample collection
- Each Sample is a Vector of Genotypes (0, 1, 2)
- Vector is ordered consistently
Genotype
Variant
Sample (ID)
Alleles
Sample
Sample (ID)
Vector[Genotype]
Vector[Variant]
ProcedureTrain K-Means
- 10 iterations
- 3 clusters
Sample
Sample (ID)
Vector[Genotype]
Vector
Vector
Vector
Results
~ 100,000 variants
#0 #1 #2GBR 0 0 89ASW 54 0 7CHB 0 97 0
The procedure reconstructs the actual populations.
Results
Performance (cluster size)
2 NODES 20 NODES(*)
Cluster Launch 10 min 30.0 min
Count chr22 genotypes (S3) 6 min 1.1 min
Save chr22 from s3 to HDFS 26 min 3.5 min
Count chr22 genotypes (HDFS) 10 min 1.4 min
(*) Cluster size / nb partitions not optimal here: 80 cores / 114 partitions
Results
Performance (cluster size)
121,023 Variants 2 NODES 20 NODES
Missing data (collect) 7.8 min 33 sec
Train (10 iter) 2.1 min 28 sec
Predict (collect) 8 sec 2 sec
Results
Performance, 20 NODES (data size)
121,023 Variants
491,222 Variants
Missing data (collect) 33 sec 3.7 min
Train (10 iter) 28 sec 1.6 min
Predict (collect) 2 sec 25 sec
On the benchGlobal Alliance for Genomic and Health (GA4GH)
http://genomicsandhealth.org/http://ga4gh.org/
- Framework for responsible data sharing- Define schemas- Define services for interoperability
On the benchGA4GH schemas
On the benchGA4GH google implementation
On the benchGA4GH google implementation
On the benchGA4GH compliant & scalable server
Open source and available on GitHub,https://github.com/med-at-scale/high-health
PRs are welcome!
On the bench
Methods grouped in micro services
GA4GH & Custom methods
Thank you
Biobankcloud, KTH (Jim Dowling)UC Berkeley AMPLab, bdgenomics.org team (Frank Nothaft, Matt Massie)Cloudera (Uri Laserson)
Hey…
Come back tomorrow morning → for demos
And afternoon → to hack on it!
top related