© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Ben Langmead, PhD, Johns HopkinsAngel Pizarro, AWS Scientific Computing
June 20, 2016
Genomics at ScaleUsing the AWS Cloud for Population-Scale Analysis of
Genomics and Life Science Data
Agenda
• Overview of Amazon Elastic MapReduce (Amazon EMR)• Review of Rail-RNA• More EMR for Science!• Q&A
Challenges with in-house infrastructure
Fixed cost
Slow deploymentcycle
Always on Self serve
Static: Not scalable Outages impact Production upgrade
Storage compute
Compute and storage grow together
Tightly coupled
Storage grows along with computeCompute requirements vary
Underutilized or scarce resources
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 260
20
40
60
80
100
120
Underutilized capacity
Provisioned capacity
ReprocessingWeekly peaks
Steady state
Amazon EMR
Amazon EMR • Managed Apache Hadoop platform• MapReduce, Apache Spark, Presto • Launch a cluster in minutes• Open source distribution and MapR
distribution• Leverage the elasticity of the cloud• Baked-in security features• Pay by the hour and save with Spot
Why Amazon EMR?
Easy to useLaunch a cluster in minutes
Low costPay an hourly rate
ElasticEasily add or remove capacity
ReliableSpend less time monitoring
SecureManage firewalls
FlexibleCustomize the cluster
Decouple storage and compute
Amazon S3 is your persistent data store
11 9’s of durability$0.03 / GB / month in US-East Lifecycle policiesVersioning Distributed by default EMRFS
Amazon S3
Why is Amazon S3 good for Genomics Data?
• No limit on the number of objects• Object size up to 5TB• Pay only for exactly what you use• Very high bandwidth• Durable• Fine-grained and time-bounded security• Supports versioning & lifecycle policies• Storage tiers for better cost, based on access patterns
Amazon EMR File System (EMRFS)
• Allows you to leverage Amazon S3 as a file system• Streams data directly from Amazon S3 • Uses HDFS for intermediates • Better read/write performance and error handling than
open source components• Consistent view – consistency for read after write• Support for encryption • Fast listing of objects
Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex( host STRING,referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe') LOCATION 'samples/pig-apache/input/'
Going from HDFS to Amazon S3
CREATE EXTERNAL TABLE serde_regex( host STRING,referer STRING, agent STRING)ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe') LOCATION 's3://elasticmapreduce.samples/pig-apache/input/'
Decoupled compute and storage in practice
Amazon S3Analyst clustersrunning Zeppelin
(m2 and R3 instance types)
Transient clusters runningsingle data cleanup or
ETL jobs at off peak times(spot instance type)
Amazon Redshift data warehouse
Machine learning cluster running R
(c1 and c3 instance types)
http://www.langmead-lab.org
http://rail.bio
Tera
base
s1Pbp
18-month doubling time
Tera
base
s1Pbp
18-month doubling time
Spot
DNA
RNA
rail-rna go elastic —-manifest URLsOf500Samples.txt —-assembly hg38 —-output s3://your-bucket/output_folder —-core-instance-type c3.2xlarge —-core-instance-count 20
InputSpecies
OutputInstance typeInstance count
http://docs.rail.bio/dbgap/
NIH has security requirements and recommendations for analyzing “controlled access” genomic data.
Protects privacy of the research subjects. Particularly concerned with data where sensitive phenotypes (e.g., disease) can ultimately be linked to a subject’s identity.
Detailed instructions on how to run your own dbGaP-compliant EMR app: docs.rail.bio/dbgap
Why is horizontal scale important?
Stages are written separatelyHandoff between steps is through files
Everyone has their own “flavor” of pipeline
A variant-calling pipeline
Parallelization in the cloud
.bam files define a custom .bai index formatUser-defined attributes
Typically in coordinate-sorted order
Lingua franca: file formats
(This is taken from the Picard library.) Why are we managing file handles and spilling reads
to disk inside our bioinformatics methods?
Where is “The Platform?”
Things fall apart when our computation changes
Flat files are a blocker to population-scale genomics
join
filter
groupBy
Stage 3
Stage 1
Stage 2
A: B:
C: D: E:
F:
= cached partition= RDD
map
• A fast and general engine for large-scale data processing
• Massively parallel
• Uses DAGs instead of map-reduce for execution
• Minimizes I/O by storing data in Resilient Distributed Datasets (RDD) in memory
• Partitioning-aware to avoid network-intensive shuffle
Bioinformaticians ❤️
Probabilistic models
Many bioinformatics methods are just large sums
• Hosted at Berkeley and the AMPLab
• Apache 2 License• Contributors from both
research and commercial organizations
• Core spatial primitives, variant calling
• Avro and Parquet for data models and file formats
Spark + Genomics = ADAM
What is ADAM?
ADAM is a genomics analysis platform with specialized file formatsBuilt using Apache Avro, Apache Spark, and ParquetGitHub repository: https://github.com/bigdatagenomics/adam
adam-submit
Process VCF from 1000 Genomes Public Data Set
VCF files located in the public S3 bucket s3.amazonaws.com/1000genomesUse vcf2adam to convert single vcf into multiple ADAM files (gzipped Apache Parquet)$ adam-submit vcf2adam <vcf file on HDFS> <target HDFS folder>
Ex: A single VCF file generates more than 690 gz.parquet files
Process VCF files from 1000 Genomes Cont.
Use SCALA to query the genome data in the ADAM Parquet files
……val gnomeDF = sqlContext.read.parquet("/user/hadoop/adamfiles/part-r-00000.gz.parquet")gnomeDF.printSchema()gnomeDF.registerTempTable("gnome")val gnome_data = sqlContext.sql("select count(*) from gnome")gnome_data.show()…….