ben%raphael% september%12,%2011%
TRANSCRIPT
9/14/11
1
CSCI2950-‐C Topics in Computa8onal Biology
Lecture 1
Ben Raphael September 12, 2011
hDp://cs.brown.edu/courses/csci2950-‐c/
Topics in Computa8onal Biology
Computer Science
Mathema8cs & Sta8s8cs
Biology
Computa(onal biology: develop computa8onal techniques to solve biological problems.
Computa(onal molecular biology: develop computa8onal techniques to solve biological problems related to DNA, RNA, and/or proteins.
Ecology? Neuroscience? Botany?
9/14/11
2
Technology Transforming Biology Un8l late 20th Century
Hypothesis Genera8on and Valida8on
21th Century and Beyond
High throughput technologies
Hypothesis Genera8on and Valida8on
Algorithms
Human Genome Sequenced 2000-‐2003
June 26, 2000
9/14/11
3
Biology 101
Molecular Biology 101
9/14/11
4
Molecular Biology 101
Central Dogma
Molecular Biology 101
Three fundamental molecules. 1) DNA – Informa8on storage.
2) RNA – Old view: Mostly a
“messenger”. – New view: Performs many
important func8ons. 3) Proteins – Perform most cellular func8ons
(biochemistry, signaling, control, etc.)
9/14/11
5
DNA
…ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…
Double helix Two strands.
Each strand composed of sequence of covalently bonded nucleo(des (bases). Four possible nucleo8des: A, C, T, G
DNA sequence: string from 4 character alphabet
DNA 3’ 5’
5’ 3’
Bases (nucleo(des) in each strand are complementary: A ßà C, T ßàG Watson-‐Crick base-‐pairing
9/14/11
6
RNA • RNA: single stranded, 4 leDer alphabet: A, C, U, G (U = T)
• Can fold into structures due to base complementarity.
• Many classes of RNA – mRNA (messenger: transcrip8on
• DNA à RNA à protein) – tRNA (transfer RNA: transla8on
• mRNA à protein – miRNA – snoRNA – piwiRNA – Etc…
Protein
• String of amino acids: 20 leDer alphabet
• Folds into 3D structures to perform various func8ons in cells
9/14/11
7
Molecular Biology 101
Molecules of DNA, RNA, and protein interact with each other in 8me and space
Computer Science! algorithms on strings, trees, graphs
Molecules of DNA, RNA, and protein are built from an “alphabet” of “characters”.
Computa8onal Molecular Biology
1. Measurement/Technology. 2. Comparison. 3. Predic8on.
For DNA, RNA, protein:
9/14/11
8
Computa8onal Molecular Biology
Measurement Derive the DNA sequence of a human. Which RNA molecules or proteins are
present in a cell at a given 8me?
No technology to answer such ques8ons exactly.
For DNA, RNA, protein:
Measurement Technology + Algorithms
Biological knowledge
Genome Sequencing and Comparison
Human Genome Project
“Next-‐genera8on” DNA sequencing
DNA sequencing
… ACGTATTAGG …
Species
9/14/11
9
Computa8onal Molecular Biology
Comparison What are the similari8es/difference
between sequences? e.g. from human and chimpanzee? [String comparison ßà Algorithms]
For DNA, RNA, protein:
Computa8onal Molecular Biology
Predic8on What differences in DNA determine
specific traits? [Gene8cs] Which muta8ons in DNA cause cancer? Does a protein bind to DNA? To what sequence? How does the sequence of RNA or protein determine
its 3D structure? Do two proteins bind/interact? …
For DNA, RNA, protein:
9/14/11
10
Why Study Computa8onal Biology?
Interdisciplinary Biology Computer Science Mathema8cs Sta8s8cs
= FUN!
WSJ: Jan, 26, 2009.
Why choose just 1?
Course goals
• Learn applica8ons of computer science, mathema8cs, and sta8s8cs to biology.
• Learn some new CS/mathema8cs. • Learn to read and cri8que research papers. – Understand how to formulate a biological problem as a computa8onal problem.
• Undertake a research project – Course project: Use your skills to solve a biological problem.
• Write a proposal (NIH style)
9/14/11
11
Course Organiza8on
Seminar style 1) Introductory lectures by me: 30-‐50% of class mee8ngs. 2) Student-‐led paper presenta8ons and discussion. 50-‐70%
3) One computer assignment: next-‐gen sequencing data. 4) Project. Groups of 1-‐3 students. Proposal: NIH style Ini8al proposal Midterm report Final report Project presenta8on
Course Organiza8on See handout • Prerequisites • Grading • Department Requirements
9/14/11
12
Topics in Computa8onal Biology
Networks Cancer
Genomes
Genome Sequencing and Comparison
Human Genome Project
“Next-‐genera8on” DNA sequencing
DNA sequencing
… ACGTATTAGG …
Species
9/14/11
13
DNA Replica8on and Muta8on
… CGTAATTAG …
… CGTCATTAG …
Species
Gene
8c
Soma8
c
Compara8ve genomics Differences between species?
Genome Sequencing and Comparison
Personal genomics Gene8c basis for traits of individuals?
Cancer genomics Which soma8c muta8ons lead to cancer?
9/14/11
14
Course Topics
1. Measurement of genomes
Human genome: ~3 billion leDers
Algorithms
Next-‐genera(on DNA sequencing Genome and Muta(ons
10-‐100’s million reads: 30-‐1000 leDers
… GGTAGTTAG …
… TATAATTAG …
… AGCCATTAG …
… CGTACCTAG …
… CATTCAGTAG …
… GGTAAACTAG …
Course Topics
Genome Millions -‐billions nucleo8des
Next-‐genera8on DNA sequencing
10-‐100’s million reads Reads: 30-‐1000 nucleo8des
… GGTAGTTAG …
… TATAATTAG …
… AGCCATTAG …
… CGTACCTAG …
… CATTCAGTAG …
… GGTAAACTAG …
1. “Resequencing” algorithms (also RNA-‐Seq, ChIP-‐Seq, *-‐Seq)
2. De novo assembly algorithms
9/14/11
15
Course Topics
gene
Algorithms
Iden8fy func8onal variants
Cancer gene8cs Which soma8c muta8ons lead to cancer?
Personal genomics Gene8c basis for traits of individuals?
2. Interpreta(on of genomes
Course Topics
3. Beyond the genome: epigene8cs
Human genome: ~3 billion leDers
Algorithms
Next-‐genera(on DNA sequencing Genome Modifica(ons
DNA Methyla(on Packaging in chroma(n
9/14/11
16
Molecular Biology 101
Three fundamental molecules. 1) DNA – Informa8on storage.
2) RNA – Old view: Mostly a
“messenger”. – New view: Performs many
important func8ons. 3) Proteins – Perform most cellular func8ons
(biochemistry, signaling, control, etc.)
What molecules can we measure?
Sequencing (expensive, high quality) Hybridiza8on (inexpensive,lower qual)
Sequencing (expensive, high qual) Hybridiza8on (noisy, lower qual)
Mass spectrometry (low quality) Hybridiza8on (very low quality!)
9/14/11
17
DNA sequencing
How we obtain the sequence of nucleo8des of a species
…ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…
DNA Sequencing Goal: Find the complete sequence of A, C, G, T’s in a DNA molecule. Human genome: String of 3 billion nucleo8des.
Challenge: There is no machine that takes long DNA as an input, and gives the complete sequence as output
Can only sequence <1000 leDers at a 8me
9/14/11
18
Shotgun Sequencing
cut many times at random (shotgun)
genomic segment
Get one or two reads from each
segment ~500 bp ~500 bp
Fragment Assembly
Cover region with ~N-fold redundancy Overlap reads and extend to reconstruct the
original genomic region
reads
First Topic: Algorithms to solve fragment assembly problem
9/14/11
19
Progress in Genome Sequencing
454 Illumina ABI SOLiD
Human Genome Project
“Next-‐genera8on” DNA sequencing
~500 bp ~500 bp
~25-100 bp ~25-100 bp
Old technology
New technologies
Progress in Genome Sequencing
454 Illumina ABI SOLiD
Human Genome Project
“Next-‐genera8on” DNA sequencing
~500 bp ~500 bp
~25-100 bp ~25-100 bp
9/14/11
20
“Third-‐genera8on” DNA Sequencing
ACGTGCATGxxxxxxxxxxxxxTGCTAGCTGGxxxxxxxxxxxxxxCCTATGGTA
subread subread subread advance advance 3-‐strobe
Strobe sequencing: Mul8ple reads per fragment
!" #" $"
%" &" '"
!(" #(" $)"
'("&)"%("
$("
&("
#)"
*+,+-./"
#("!("
#)"
$("
$)"
$("$)"
#)"
%(" &)"
%("&("
'("
'("
!"#$ !%#$!("
#("
01231"
01231" +/*"
+/*"
Ritz, Bashir, and Raphael (2010). Bioinforma8cs.
Single Molecule Real Time (SMRT) Sequencing from Pacific Biosciences
Course Topics
gene
Algorithms
Iden8fy func8onal variants
Cancer gene8cs Which soma8c muta8ons lead to cancer?
Personal genomics Gene8c basis for traits of individuals?
2. Interpreta(on of genomes
9/14/11
21
Soma8c Muta8ons and Cancer
Time (cell divisions)
Passenger muta8ons
Driver muta8on
“typical tumor”: ~10 driver muta8ons 100’s – 1000’s of passenger muta8ons
Sequence genome
Founder cell
Cell po
pula8o
n
Clonal Theory (Nowell 1976)
Cancer Genomes
Breast
Leukemia
9/14/11
22
Rearrangements in Tumors Change gene structure, create novel fusion genes
Gleevec (Novar8s 2001) targets ABL-‐BCR fusion
Func8onal Muta8ons
Next-‐genera8on DNA sequencing gene : soma8c muta8on
Recurrent muta8ons/mutated genes à func8onal
91 glioblastoma mul8forme samples
Page 16 N
IH-P
A A
uthor Manuscript
NIH
-PA
Author M
anuscript N
IH-P
A A
uthor Manuscript
Figure 1. Significant copy number aberrations and pattern of somatic mutations (a) Frequency and significance of focal high-level copy-number alterations. Known and putative target genes are listed for each significant CNA, with “Number of Genes” denoting the total number of genes within each focal CNA boundary. (b–c) Distribution of the number of (b) silent and (c) non-silent mutations across the 91 GBM samples separated according to their treatment status, showing hypermutation in 7 out of the 19 treated samples. (d) Significantly mutated genes in 91 glioblastomas. The eight genes attaining a false discovery rate <0.1 are displayed here. Somatic mutations occurring in untreated samples are in dark blue; those found in statistically non-hypermutated and hypermutated samples among the treated cohort are in respectively lighter shades of blue.
Nature. Author manuscript; available in PMC 2009 April 23.
9/14/11
23
Recurrent Muta8ons
Cancer is a disease of “pathways”.
What pathways are altered/mutated?
36%
2%
6%
18%
21%
3% 2% 5% 3%
4%
[Hanahan and Weinberg, Cell 2000]
Biological Interac8on Networks Many types: • Protein-‐DNA (regulatory)
• Protein-‐metabolite (metabolic)
• Protein-‐protein (signaling)
• RNA-‐RNA (regulatory)
• Gene8c interac8ons (gene knockouts)
9/14/11
24
Protein-‐Protein Interac8on Network?
• Proteins are nodes • Interac8ons are edges • Edges may have weights
Yeast PPI network
H. Jeong et al. Nature 411, 41 (2001)
Problem Given:
1. Large-‐scale interac8on network 2. Muta8on data from mul8ple cancer
pa8ents Find: Subnetworks mutated in a significant number of samples
gene
= muta8on
9/14/11
25
Soma8c Muta8ons and Cancer
Time (cell divisions)
Passenger muta8ons
Driver muta8on
“typical tumor”: ~10 driver muta8ons 100’s – 1000’s of passenger muta8ons
Sequence genome
Founder cell
Cell po
pula8o
n
Clonal Theory (Nowell 1976)
Given: cross-‐sec8onal data from many individuals. Goal: Infer order of muta8ons.
Course Topics
gene
Algorithms
Iden8fy func8onal variants
Personal genomics Gene8c basis for traits of individuals?
2. Interpreta(on of genomes
Cases
Controls
9/14/11
26
Course Topics
3. Beyond the genome: epigene8cs
Human genome: ~3 billion leDers
Algorithms
Next-‐genera(on DNA sequencing Genome Modifica(ons
DNA Methyla(on Packaging in chroma(n
Course Topics
Machine Learning
Networks Cancer
Genomes
Algorithms
Systems/Programming
Data Mining/Analysis/Visualiza8on
9/14/11
27
Reading
• Biology background. • Probability background. • DNA sequencing technologies • See course website.