aws re:invent 2016: building a platform for collaborative scientific research on aws (lfs301)
TRANSCRIPT
© 2016, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Stephen Terrell, Senior Cloud Engineer, Human Longevity Inc.
Ryan Ulaszek, Cloud Architect, Human Longevity Inc.
Lance Smith, IT Director, Celgene
Patrick Combes, Principal Solution Architect, AWS
November 28, 2016
Building a Platform for Collaborative
Scientific Research on AWS
LFS301
Collaborative Scientific Research
The past decade has seen tremendous growth in
collaborative research
“New collaboration patterns are
changing the global balance of
science. Established
superpowers need to keep up
or be left behind…” Jonathan
Adams Nature 490, 335–336 (18 October
2012) doi:10.1038/490335a Published online 17 October 2012
…in the last decade, thousands of researchers, pharmaceutical and biotechnology companies, government regulators, payers, clinicians and patients have come together in more than 100 multi-stakeholder collaborations to solve some specific shared problem… Carol Cruzan MortonThe Science of Collaboration, Center for Biomedical Innovation August 29, 2013
Onsite
Meet in the middle
DevOps
Coordinated
development,
deployment and
operation
AWS as a Home for Collaborative Research
Multi-site,
distributed
collaborations
What to Expect from the Session
• About Human Longevity Inc.
• Challenges we faced
• The solution
• Our journey
• Summary
The Next Frontier in Medicine
OUR MISSION:
Changing the practice
of medicine, making it predictive,
preventative and genomics-based.
Ultimately, our goal is to extend the
healthy, high performance and
productive life span.
Descriptive
Medicine
Traditional
Medical Record
~3.5 GB
Digital Health
Record - ~150
GB
Deep Representation of
systems : N-of-1
10,000 Genomes
~1 PB
N-of-Thousands
Human Virome
1 Trillion Similarity Searches
10 million
Genomes
~1 EB
Future
Medicine as Data Science
Challenges
Redundant infrastructure
consuming time and resourcesComplexity and scale of data
compound the problem
Commercializing bioinformatics
innovation is painful and slow
Significant storage and
compute costs
The Solution is a Common Platform
+
Create a pluggable common platform for
genomics pipelines using AWS managed services
Leverage platform to
simplify pipelines and
optimize for cost
Ease and accelerate
transition to production by
standardizing on a common
platform
Move to continuous
delivery to go
faster with quality
The Journey
1 2 3 4 5
Building a common genomics platform was an iterative process
There were five key steps along the way
The Journey
1 2 3 4 5
Get a simple genomics pipeline up and running in two weeks
so downstream teams can start iterating
Up and running
in a sprint
Up and Running in a Sprint
Poll a queue for a sample and run the bioinformatics tools in sequence
on an instance using the tools baked into the AMI
AWS OpsWorks Stack
Sample Queue Amazon S3
Up and Running in a Sprint
Up and Running in a Sprint
Up and Running in a Sprint
Up and Running in a Sprint
Easy to implement and worked well as a
starting point
Easy to auto-scale instances
ⅹ Pain around manually building and
updating AMI
ⅹ Writing our own workflow engine
ⅹ Can’t optimize for workload at each step
ⅹ No cost optimization since using on-
demand instances
Benefits Drawbacks
The Journey
1 2 3 4 5
Get a simple genomics pipeline up and running in two weeks
so downstream teams can start iterating
Up and running
in a sprint
The Journey
Up and running
in a sprint
Adapting to tool
change
1 2 3 4 5
Difficult to accommodate constant bioinformatics tool changes in the pipeline
Adapting to Tool Change
Migrate to tools running in Docker containers
on Amazon EC2 instances in AWS OpsWorks
OpsWorks Stack
Sample Queue S3 ECR
Adapting to Tool Change
Adapting to Tool Change
Adapting to Tool Change
Adapting to Tool Change
Adapting to Tool Change
✓ Easy to auto-scale instances
✓ Easy to accommodate tool changes
ⅹ Pain around manually building and
updating AMI
ⅹ Writing our own workflow engine
ⅹ Can’t optimize for workload at each step
ⅹ No cost optimization since using on-
demand instances
ⅹ Painful to support Docker on EC2
instances ourselves
ⅹ Pain around version and deploy of
images
Benefits Drawbacks
The Journey
Up and running
in a sprint
Adapting to tool
change
1 2 3 4 5
Difficult to accommodate constant bioinformatics tool changes in the pipeline
The Journey
Up and running
in a sprint
Adapting to tool
change
Report pipeline
platform
1 2 3 4 5
Need to accommodate additional report pipelines
and their increasing complexity
Report Pipeline Platform
AWS Flow Framework for Ruby is a collection of
convenience libraries that make it faster and easier to
build applications with Amazon SWF
SWF as a fully managed state tracker and task
coordinator in the cloud
Report Pipeline Platform
Migrate workflow solution to SWF and AWS Ruby Flow Framework
OpsWorks Stack
S3 ECRSWFLauncherTopic
Report Pipeline Platform
Report Pipeline Platform
Report Pipeline Platform
Report Pipeline Platform
Report Pipeline Platform
Report Pipeline Platform
Report Pipeline Platform
Report Pipeline Platform
Report Pipeline Platform
Report Pipeline Platform
Report Pipeline Platform
✓ Easy to accommodate tool changes
✓ Easy to auto-scale instances
✓ Easy to accommodate new pipelines
and run steps in parallel
✓ Easy handle failures at each step
ⅹ Can’t optimize for workload at each step
ⅹ No cost optimization since using on-
demand instances
ⅹ Painful to support Docker on EC2
instances ourselves
ⅹ Pain around version and deploy of
images
ⅹ Complex workflows could not be
supported
Benefits Drawbacks
The Journey
Up and running
in a sprint
Adapting to tool
change
Report pipeline
platform
1 2 3 4 5
Difficult to accommodate additional report pipelines
and their increasing complexity
The Journey
Up and running
in a sprint
Adapting to tool
change
Report pipeline
platformDocker Pipeline
platform
1 2 3 4 5
Many redundant workflow systems
that are suboptimal for cost and performance
Bioinformatics scientists are bogged down in infrastructure development
Docker Pipeline
SWF
Flow
Amazon
ECS
Track steps within a workflow
Orchestrate steps within a workflow Provision spot instances
Spot Fleet
Run steps on a cluster
Amazon
DynamoDB
Register Pipelines and Tasks Glue everything together
AWS Lambda
Docker Pipeline
$ dpl task register
$ dpl pipeline register
$ dpl pipeline run
Define resource requirements for a
tool and register that tool which makes
it available for use in pipelines.
Docker Pipeline
$ dpl task register
$ dpl pipeline register
$ dpl pipeline run
Define a pipeline that uses tasks
already registered within the system
Docker Pipeline
$ dpl task register
$ dpl pipeline register
$ dpl pipeline run
Run a pipeline by name and with any
arguments that are needed
Docker Pipeline
$ dpl task register
$ dpl pipeline register
$ dpl pipeline run
ancestry user$ dpl task register task.json
{
task_id: ancestry_1.0.5_container
}
Docker Pipeline
$ dpl task register
$ dpl pipeline register
$ dpl pipeline run
Docker Pipeline
$ dpl task register
$ dpl pipeline register
$ dpl pipeline run
Docker Pipeline
$ dpl task register
$ dpl pipeline register
$ dpl pipeline run
Docker Pipeline
$ dpl task register
$ dpl pipeline register
$ dpl pipeline run
pipeline user$ dpl pipeline register
{
pipeline_id: demo_1.0.1
}
Docker Pipeline
$ dpl task register
$ dpl pipeline register
$ dpl pipeline run
Docker Pipeline
$ dpl task register
$ dpl pipeline register
$ dpl pipeline run
Docker Pipeline
$ dpl task register
$ dpl pipeline register
$ dpl pipeline run
Docker Pipeline
$ dpl task register
$ dpl pipeline register
$ dpl pipeline run
Docker Pipeline
$ dpl task register
$ dpl pipeline register
$ dpl pipeline run
Docker Pipeline
$ dpl task register
$ dpl pipeline register
$ dpl pipeline run
pipeline user$ dpl run demo
{
"domain": "dpl-ind-a",
"pipelineId": "demo_1.0.0",
"runId": " 23BFVdVVBvOvGOebScSv7…
"workflowId": " 5079851a-8802-4141-b9...
}
Docker Pipeline
Docker Pipeline
OpsWorks Stack
SWF ECR
ECS
Spot Fleet
Lambda
Task RegistryPipeline RegistryImage
$ dpl pipeline register $ dpl task register $ docker push
$ dpl run
Pipeline Definition Task Definition Tool
Docker Pipeline
SQS Lambda
S3
OpsWorks Stack
SWF ECR
ECS
Spot Fleet
Lambda
Task RegistryPipeline RegistryImage
$ dpl pipeline register $ dpl task register $ docker push
$ dpl run
Pipeline Definition Task Definition Tool
Docker Pipeline
SQS Lambda
1
S3
OpsWorks Stack
SWF ECR
ECS
Spot Fleet
Lambda
Task RegistryPipeline RegistryImage
$ dpl pipeline register $ dpl task register $ docker push
$ dpl run
Pipeline Definition Task Definition Tool
Docker Pipeline
SQS Lambda
12
S3
OpsWorks Stack
SWF ECR
ECS
Spot Fleet
Lambda
Task RegistryPipeline RegistryImage
$ dpl pipeline register $ dpl task register $ docker push
$ dpl run
Pipeline Definition Task Definition Tool
Docker Pipeline
SQS Lambda
123
S3
OpsWorks Stack
SWF ECR
ECS
Spot Fleet
Lambda
Task RegistryPipeline RegistryImage
$ dpl pipeline register $ dpl task register $ docker push
$ dpl run
Pipeline Definition Task Definition Tool
Docker Pipeline
SQS Lambda
123
4
S3
OpsWorks Stack
SWF ECR
ECS
Spot Fleet
Lambda
Task RegistryPipeline RegistryImage
$ dpl pipeline register $ dpl task register $ docker push
$ dpl run
Pipeline Definition Task Definition Tool
Docker Pipeline
SQS Lambda
S3
OpsWorks Stack
SWF ECR
ECS
Spot Fleet
Lambda
Task RegistryPipeline RegistryImage
$ dpl pipeline register $ dpl task register $ docker push
$ dpl run
Pipeline Definition Task Definition Tool
Docker Pipeline
SQS Lambda
S3
OpsWorks Stack
SWF ECR
ECS
Spot Fleet
Lambda
Task RegistryPipeline RegistryImage
$ dpl pipeline register $ dpl task register $ docker push
$ dpl run
Pipeline Definition Task Definition Tool
Docker Pipeline
SQS Lambda
S3
OpsWorks Stack
SWF ECR
ECS
Spot Fleet
Lambda
Task RegistryPipeline RegistryImage
$ dpl pipeline register $ dpl task register $ docker push
$ dpl run
Pipeline Definition Task Definition Tool
Docker Pipeline
SQS Lambda
S3
OpsWorks Stack
SWF ECR
ECS
Spot Fleet
Lambda
Task RegistryPipeline RegistryImage
$ dpl pipeline register $ dpl task register $ docker push
$ dpl run
Pipeline Definition Task Definition Tool
Docker Pipeline
SQS Lambda
S3
OpsWorks Stack
SWF ECR
ECS
Spot Fleet
Lambda
Task RegistryPipeline RegistryImage
$ dpl pipeline register $ dpl task register $ docker push
$ dpl run
Pipeline Definition Task Definition Tool
Docker Pipeline
SQS Lambda
S3
Docker Pipeline
✓ Easy to accommodate tool changes
✓ Easy to auto-scale instances
✓ Easy to accommodate new workflows
and run steps in parallel
✓ Easy to optimize for workload and
handle failures at each step
✓ Easy to accommodate complex
workflows
✓ Easy to share tools across pipelines
and greatly simplifies pipeline
definition
ⅹ Can’t optimize for workload at each step
ⅹ Painful to support Docker on EC2
instances ourselves
ⅹ No cost optimization since using on-
demand instances
ⅹ Pain around version and deploy of
images
ⅹ Pain around version and deploy of
platform
Benefits Drawbacks
The Journey
Up and running
in a sprint
Adapting to tool
change
Report pipeline
platformDocker pipeline
platform
1 2 3 4 5
Many redundant workflow systems that are
suboptimal for cost and performance
Bioinformatics scientist are bogged down in infrastructure development
The Journey
Go faster with
continuous deliveryUp and running
in a sprint
Adapting to tool
change
Report pipeline
platformDocker pipeline
platform
1 2 3 4 5
Deploying changes frequently was very time consuming and risky
Go Faster with Continuous Delivery
Continuous Delivery at HLI
Integration TestingAutomation Push-Button to Prod
Go Faster with Continuous Delivery
Continuous
Integration
Latest Build
AWS CodePipeline
Dev
1. Deploy
2. Smoke
Test
AWS
CodeDeploy
AWS
Lambda
Go Faster with Continuous Delivery
Continuous
Integration
Latest Build
AWS CodePipeline
Dev
1. Deploy
2. Smoke
Test
AWS
CodeDeploy
AWS
Lambda
Int
1. Deploy
2. Integration
Test
AWS
CodeDeploy
AWS
Lambda
Promote
Go Faster with Continuous Delivery
Continuous
Integration
Latest Build
AWS CodePipeline
Dev
1. Deploy
2. Smoke
Test
AWS
CodeDeploy
AWS
Lambda
Int
1. Deploy
2. Integration
Test
AWS
CodeDeploy
AWS
Lambda
Stage
1. Blue/Green
Deploy
2. Integration
Test
AWS
CodeDeploy
AWS
Lambda
Amazon
SNS
3. Approval to
Prod
Promote Promote
Go Faster with Continuous Delivery
Go Faster with Continuous Delivery
Go Faster with Continuous Delivery
Go Faster with Continuous Delivery
Go Faster with Continuous Delivery
✓ Easy to accommodate tool changes
✓ Easy to auto-scale instances
✓ Easy to accommodate new
workflows and run steps in parallel
✓ Easy to optimize for workload and
handle failures at each step
✓ Easy to accommodate complex
workflows
✓ Easy to deploy platform changes
into production daily
Benefits Drawbacks
ⅹ Pain around version and deploy of
images
ⅹ Pain around version and deploy of
platform
Summary
Dramatic simplification in pipeline complexity
From 2KLOC to 20LOC + config
Significant reduction in time to generate reports
From weeks to hours fully automated
Significant cost saving with spot
From $32 to $6 for report
Daily deployments of platform changes to production environments
From weeks or months to daily
Dramatically easier handoff between bioinformatics and engineering
From code to configuration
Next Steps
• Create framework to simplify tool development
• Support running step workload on Spark cluster
Celgene Research
Collaboration Environments
Agenda
• Company & Industry Trends
• Collaboration Models
• Configuration & Security
• Lessons Learned & Tips
About Celgene
Biotech focused on cancer and inflammatory diseases with 300+ clinical
trials in progress. Major products include Otezla, Revlimid, Thalomid, and
Pomalyst.
Scope
• Discovery research
• Clinical development
• Drug manufacturing
• Sales/Distribution
Scale
• ~7000 employees
• ~60 sites globally
Industry Trends
• Collaborations and Partnerships
• Even faster paced R&D
• Scale
• Cloud native solutions
Myeloma Genome Project on AWS
For more information, please contact [email protected]
Managed Software / SaaS
• COTS platform for end users
• Web GUI
• Lab data
Many Collaboration Systems, 2 Models
HPC Collaboration
• Raw IaaS for developer-users
• API / Shell access
• Petabyte scale
Collaboration Structure
CRO
vendor
Many Collaborations
• Two collaborations models
• Multi-AWS account
• AWS + management is the same
Example MSaaS Collaboration Architecture
Plugin
Pipeline
Data
processing
Extraction
processing
Transaction
processing
Web servers
Amazon
SQS
Amazon
S3
Logging SSO
Platform
Services
RDS
WAFusers
Example HPC Collaboration Architecture
VPC subnet
VPC subnet
Auto
Scaling
cloudwatch
SQSCloudTrail
EFS
bastion
S3
VPCe
Connectivity
• Multi-account model + VPC
• Connectivity options
• Big decisions factorsX
Multi Account Model
• Isolation of workloads
• Ease of management
• Guardrails Tool
• TurbotHQ.com
Hardened AWS Environment
• Network Controls
• Object Storage Controls
• Credentials
• Auditing
All collaborations use:
• EC2 / ECS
• S3 / Glacier
• EFS
• VPC + Direct Connect
Services Used
Primary Reasons for AWS
• Speed of deployment
• Security / isolation
• Elastic nature supports
unknown requirements
Regions used:
• us-east-1
• us-west-1
• eu-central-1
• us-west-2*
• COTS Software
• Vendor API Access
• User access via app/SSO
• Roles for app
• Account isolation
Access for MSaaS CollaborationsIAM Policy: SQS access
{
"Effect": "Allow",
"Action": [
"sqs:GetQueueAttributes",
"sqs:GetQueueUrl",
"sqs:PurgeQueue",
"sqs:SendMessage",
"sqs:ReceiveMessage",
"sqs:DeleteMessage"
],
"Resource": "arn:aws:sqs:us-west-1:111122223333:a-queue"
}
IAM Policy: Not allowed!
{
"Effect": "Allow",
"Action": “ec2:*",
"Resource": “*"
}X
• Software “Type”
• User access
• IAM Policies
• AWS Console
User access for HPC/IaaS Collaborations
• Automated security
• Business rules
• Data sciences / management
User access for S3 buckets
Bucket Policy: Server Side Encyption
For AWS provided keys (SSE-S3):
{
"Sid": “ObjectsMustBeEncryptedAtRest",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::example-bucket/*",
"arn:aws:s3:::example-bucket" ],
"Condition": {
"StringNotEquals": {
"s3:x-amz-server-side-encryption": "AES256"
}
}
For KMS provided key (SSE-KMS):
"s3:x-amz-server-side-encryption":"aws:kms"
For customer provided key (SSE-C):
"s3:x-amz-server-side-encryption-customer-
algorithm": "AES256"
Bucket Policy: Require HTTPS
{
"Sid":
“ObjectsMustBeEncryptedInTransit",
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": [
"arn:aws:s3:::example-bucket/*",
"arn:aws:s3:::example-bucket"
],
"Condition": {
"Bool": {
"aws:SecureTransport": false
}
}
Only from select IP:
"Condition": {
"NotIpAddress": {
"aws:SourceIp": [
Or from endpoint:
"Condition": {
“StringNotEquals": {
"aws:SourceVpce": “vpce-1a2b3c4d”
Or VPC (if multiple endpoints):
"aws:SourceVpc": “vpc-111bbb22”
IAM Policy: Limit keys (aka “folder” location)
{
“Sid”: “LimitListBucketForUsers”
"Effect": "Allow",
"Action": "s3:ListBucket",
"Resource": "arn:aws:s3:::example-bucket",
"Condition": {"StringLike": {"s3:prefix":
"${aws:username}/*"}}
},
{
“Sid”: “ObjectsActionsForUsersInHome”
"Effect": "Allow",
"Action": [
"s3:PutObject",
"s3:GetObject",
"s3:DeleteObject",
"s3:Abort*"
],
"Resource": "arn:aws:s3:::example-
bucket/${aws:username}/*"
}
Code
• Enterprise GitHub
Collaboration Data / IP / Code
Data
• Data retention policy
• Data Science team
Collaboration Environment
• Long term
• Open source
Use cloud best practices
• Expect failure
• Automate
• Use services (as intended)
• Data transfer / errors
Lessons Learned / Tips
Soft lessons
• Past enterprise experience
• Vendors
• Users
• Get buy-in
AWS Summary
Similar advantages
Rapid infrastructure deployment
Isolated work areas
Common components drawn into larger reusable framework
Elastic resources: accommodate any size workload
Accessible: reach the infrastructure from anywhere in the world
…drive toward
Reliable and *reproducible* collaborative science at a scale
previously unachievable
Thank you!
Remember to complete
your evaluations!