national cancer institute october 6, 2009 jack r. collins, ph.d. director, advanced biomedical...

43
National Cancer Institute October 6, 2009 Jack R. Collins, Ph.D. Director, Advanced Biomedical Computing Center National Cancer Institute Frederick, Maryland, USA NCI-Frederick/SAIC-Frederick, Inc. Applying HPC to Biology: The Digital Age

Upload: warren-greer

Post on 25-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

Nat

iona

l Can

cer

Inst

itute

October 6, 2009Jack R. Collins, Ph.D.Director, Advanced Biomedical Computing CenterNational Cancer InstituteFrederick, Maryland, USA

NCI-Frederick/SAIC-Frederick, Inc.

Applying HPC to Biology:The Digital Age

ABCC Mission Summary

Provide high performance computational resources to the NCI/NIH biomedical community

Provide storage, backup, network, access control, system administration and security functionalities to NCI/NCI-F

Support NCI initiatives including imaging, bioinformatics, proteomics, and nanobiology

Provide new computational technologies for application to biomedical problems

Time to solution must be measured in heartbeats

In 2008, one person is expected to die from cancer every 56 seconds in the United States. HPC must enable scientists to impact cancer treatment.

Paradigm Shift in Biology

Computers are getting fast enough and we are now collecting enough data that we can begin to generate reasonable models that can be tested and refined to better mimic reality.

If an approximate model can help “refine” 10% of the HT experiments at NCI, it could save over $1M per year in consumables and accelerate scientific understanding.

Computer Science is starting to notice: (Many recent articles in ACM/IEEE journals.)

NCI Vision for Translational Research

Function

al

Biology

Ctr.

High-

throughp

ut target

screening

Chemica

l Biology

Consort GMP

producti

on

Preclinic

al

testing

Academic res.

labs Private sector

CLOUD Patient data

Science data

TCGA TARGE

T CGEMS

caHUB

caBIGBigHEALTH

Ca eHR

Characterization center

University ca ctr.NCCCPSPOREsCCOPsCoop. Grps.

Clinical Ctr.

Causal

pathway

s

Tissue

Patient selection

Grantee consortia Imaging

Sequencing

AndMicroarra

y

Nanotechnology

Proteomics

HIV DrugResistance

CAPRGEMs

Molecular Structure

caBIG®

Data Driven ComputationIntegration and Understanding is key

Next-Gen Sequencing

Metabolomics Structural Biology

Epigenomics Regulatory Networks

Nanotechnology

Micro-array Protein Pathways

Drug Design (traditional)

Comparative Genomics GWAS

Systems Biology

Data Analytics Pattern Recognition

Proteomics Image Analysis / Visualization

Clinical Outcome

Examples

• Genomics• Imaging• Nano-structures and Properties

“Next-Generation” Sequencing Technologies

Not just one.

But “farms” in multiple labs.

Output from one Illumina paired-end run generates ~7TB of raw data.

NextGen Data “Tsunami”

2009 2010 2011 2012 2013 2014 20150

1000

2000

3000

4000

5000

6000

7000

8000

9000

10000

TB

of

Dat

a

20 Mb/hr

50 Mb/hr

100 Mb/hr

300 Mb/hr

500 Mb/hr

1000 Mb/hr

2500 Mb/hr

AT

RF

SOLiD™

Aligning Traces to reference

ABCC—NextGen Sequencing Support

Polymorphism Analysis

The Cancer Genome Atlas

• Cancer Genome Atlas Gets $275M Funding from Stimulus, NCI and NHGRI (October 01, 2009)

• NEW YORK (GenomeWeb News) – The Cancer Genome Atlas project will receive a total of $275 million over the next two years to fund genomic mapping of more than 20 types of cancer.

• The $175 million in ARRA funding announced Wednesday by President Barack Obama will be buttressed by an additional $100 million from the National Cancer Institute and the National Human Genome Research Institute.

• Obama said in a speech at the National Institutes of Health's campus on Wednesday that genomics and human genetics research have begun to generate hope for cancer treatments, but, "We've only scratched the surface of these kinds of treatments, because we've only begun to understand the relationship between our environment and genetics in causing and promoting cancer."

• Over the two-year period, TCGA plans to collect more than 20,000 tissue samples from more than 20 cancer types, complete maps of the genomic changes in 10 of those cancers, and sequence and characterize at least 100 tumors of as many as 15 additional cancers. These maps will be deposited into public databases for use by the worldwide research community in research programs aimed at finding new ways to diagnose, treat, and prevent cancer.

TCGA Storage / Compute Requirements

−600GB per patient per disease−500 patients per disease

•300TB of data per disease−20 cancer types

•6PB of primary data

•Data Annotation•Data Integration•Analysis of high-dimensional data for patterns

Google as an HPC model

But I don’t want just results, I want relationships between my results based on ontologies and other metrics.

Analyzing High-dimensional Data

o Complex Task: Computer Scientists / Mathematicians Needed!o Non-intuitive properties (eg. Mario Valle, 2008) … so Efficient

Methods/Algorithms Needed!o Appropriate Computing Platforms (memory, multi-core, cell,

FPGA, GPGPU, ?)o An Interface to Utilize the Compute Platform (Programming

Model for mortals with finite time)

NCI In Silico Research Centers

Supporting investigator-initiated, hypothesis-driven research into the etiology, treatment, and prevention of cancer using in silico methods• Generating and publishing novel cancer research findings

mining existing data resources such as TCGA• Identifying novel bioinformatics processes and tools to

exploit existing data resources

Advocacy for and input into caBIG enhancements• integration and interoperability of data and analytical

services• Infrastructure

• NCI Investing in in silico research pilot over next 3 years• Five extramural and one intramural award

Non-B DNA and Chromosomal RearrangementMoving from “What(SNPs)” to “Why”

Robert D. Wells

Telomerase Targeted Anti-cancer AgentsTelomerase Discovery -> Nobel Prize in Medicine 2009

Mol. Cancer Therapeutics Vol. 1:103 (Dec 2001)

Examples

• Genomics• Imaging• Nano-structures and Properties

Imaging

Tumor Angiogenesis ModelingTumor Segmentation (GBM)

New Fluorescent MarkersConfocal Image Analysis

Cellular Imaging - Biomedical Uses(Emphasis on oncology and personalized medicine)

• FRET Studies of protein dynamics and function• Single cell molecular profiling via antibody labeling different proteins in

different tissue sections.• Accurate delineation of the edges of tumors.• Assessment of vascularization of tumors.• Assessment of immune cell infiltration.• Localization of proliferating and apoptotic cells.• Determine sites of extra-cellular matrix degradation and cell invasion.• Investigate metastasis• Analysis of genomic instability and gene organization using FISH labeling• FRAP investigation of protein diffusion kinetics within cells

Green Fluorescent ProteinTeal Fluorescent ProteinYellow Fluorescent ProteinRed Fluorescent Protein

Fluorescent Proteins - NCI/ABCC focusFluorescent Proteins - NCI/ABCC focus

Copyright 2004-2009 OLYMPUS CORPORATION All Rights Reserved

Approximately 3000 fluorescent probes for biology.http://probes.invitrogen.com/handbook/

Protein Engineering / rational design

A priori calculation of spectroscopic characteristics due to different chromophores

A priori calculation of spectroscopic shifts due to mutations in protein.

Accurately estimate quantum yieldsA priori calculation of maturation kineticsCalculate factors in thermal stability and protein-protein

interactions

Typical errors of standard quantum chemistry calculations for such systems, even in the gas phase, may amount up to 0.2-0.5 eV

Errors of 50 nm for the optical range ~ 500 nm are too large

2.5 eV ~ 500 nm

S0

S1

Let us assumethe error +0.25 eV ~ 50 nm

2.75 eV ~ 450 nm

S0

S1

Accuracy of calculations

GREEN

BLUE

Let us assumethe error -0.25 eV ~ 50 nm

Yellow2.25 eV ~ 550 nm

Method N-Chromophore A-Chromophore

DE(eV) L (nm) f DE(eV) L (nm) f

TDDFT/B3LYP//B3LYP/6-31+G**

3.46 358.8 0.69 4.18 296.3 0.10

3.06 405.5 0.98

TDDFT/BP86//B3LYP/6-31+G**

3.19 389.0 0.55 3.65 340.0 0.15

2.94 422.1 0.86

CIS// B3LYP/6-31+G** 4.43 279.8 1.14 6.99 177.4 0.52

3.75 330.3 1.55

ZINDO//B3LYP/6-31+G**

3.45 359.9 0.96

2.59 479.1 1.22

S0-S1 excitation energies for the free neutral (N) and anionic (A) chromophore of GFP

-A Chromophore L(exp)=479 nm

GFP

mTFP1

*Hui-wang Ai et al., BMC Biology, 2008, 6:13

Crystal structure of mTFP1 (2HQK)

Absorption Fluorescence

Chromophore in GFP and mTFP1

Ultra-high Resolution Digital Pathology

Computational Cost

2K

Log2

4 8 16 32 64 2568x32

Computational Storage / Cost O(N)

Increasing pixel size (intensity palette)

2K X 2K X16bit = 8MB

200K X 200K X 48bit = 240GB50 angles = 12TB per image

0.6 1.1

6.9

Analysis of Images (antibodies)

Replica + Steered Dynamics

Ga = 7.3kcal/mol

50ns / trajectory

MD1

MD2

MD3

MD4

MD5

MD6

HPC Required to Differentiate Structure/Properties

Overall Computational Cost

2K

Log2

4 8 16 32 64 2568x32

Computational Storage / Cost O(N)

Modeling Cost

Examples

• Genomics• Imaging• Nano-structures and Properties

Nano-scale Modeling

Nanoparticles: DF1 vs. DF1-mini

Would you expect the biological properties to be similar?

DF1 DF1-Mini

HemolysisCell Lysis

MembraneInteraction

Kidney Failure

(salt, pH)

Aggregation(salt, pH)

Liver Failure(hydrophobic

)

Aggregation(hydrophobic

)

Immune Resp.

IgG

Anti-drugOr Protective

Reactivity(OH Capture)

Critical Biological/Toxicity Differences

Structure: Non-intuitive Results Explain Toxicity

Note the large differences in exposed fullerene surface among the two 3D model structures

DF1

DF1-Mini

HPC “Compute Cloud”

NIH currently gathering requirements across all of the Institutes

Virtualization may have a bigger impact in the near term.

Data Tsunami – What do we need?“Bioinformatics is modern biology”

Not just more datadata is more complex

Storage• High capacity (PB) storage farms• High-speed access to PB of storage• Automated MetaData Extraction• Relational Data Integration• Distribution / Security

Network• Data Transfer• Collaborative Interaction• Access to National Resources Both Compute and Experimental

Compute• Not Just Floating Point!• Data Analysis / Mining for High-Dimensional PB Datasets• New Algorithms/Software -> Hypothesis Generation • Software/Languages to implement the algorithms in parallel …• On Heterogeneous compute platforms (CPU, GPGPU, Cell, FPGA)

Information -> Knowledge• Proper analysis and Visualization lead to …• Human Understanding

Acknowledgements

• Igor Topol• Bob Stephens, Alex Levitsky• Brian Luke• Robert Wells and Albino Bacolla• Yanling Liu, Stephen Lockett, Joe Kalen, Chris Kurcz• Raul Cachau• And you - Thank You

Discussion

Katsushika Hokusai, The Great Wave off Kanagawa, 1832

What do we need?

Computation:• New algorithms for parallel computers• Software/Languages to implement the algorithms• Heterogeneous compute platforms (CPU, GPGPU, Cell, FPGA)

Storage and Access:• High capacity (PB) storage farms• High-speed access to PB of storage• High-speed networks

Information -> Knowledge:• Proper analysis and Visualization• Understanding and progress

“Consolidate locally, distribute globally”

Traditional Imaging -> Quantitative Results