ben%raphael% september%12,%2011%

27
9/14/11 1 CSCI2950C Topics in Computa8onal Biology Lecture 1 Ben Raphael September 12, 2011 hDp://cs.brown.edu/courses/csci2950c/ Topics in Computa8onal Biology Computer Science Mathema8cs & Sta8s8cs Biology Computa(onal biology: develop computa8onal techniques to solve biological problems. Computa(onal molecular biology: develop computa8onal techniques to solve biological problems related to DNA, RNA, and/or proteins. Ecology? Neuroscience? Botany?

Upload: others

Post on 01-Jun-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Ben%Raphael% September%12,%2011%

9/14/11  

1  

CSCI2950-­‐C  Topics  in  Computa8onal  Biology  

Lecture  1  

Ben  Raphael  September  12,  2011  

hDp://cs.brown.edu/courses/csci2950-­‐c/  

 

Topics  in  Computa8onal  Biology  

Computer  Science  

Mathema8cs  &  Sta8s8cs  

Biology  

Computa(onal  biology:  develop  computa8onal  techniques  to  solve  biological  problems.  

Computa(onal  molecular  biology:  develop  computa8onal  techniques  to  solve  biological  problems  related  to  DNA,  RNA,  and/or  proteins.  

Ecology?    Neuroscience?  Botany?  

Page 2: Ben%Raphael% September%12,%2011%

9/14/11  

2  

Technology  Transforming  Biology  Un8l  late  20th  Century  

Hypothesis  Genera8on  and  Valida8on  

21th  Century  and  Beyond  

High  throughput  technologies  

Hypothesis  Genera8on  and  Valida8on  

Algorithms  

Human  Genome  Sequenced  2000-­‐2003  

June  26,  2000  

Page 3: Ben%Raphael% September%12,%2011%

9/14/11  

3  

Biology  101  

Molecular  Biology  101  

Page 4: Ben%Raphael% September%12,%2011%

9/14/11  

4  

Molecular  Biology  101  

Central  Dogma  

Molecular  Biology  101  

Three  fundamental  molecules.  1)  DNA  –   Informa8on  storage.      

2)  RNA  –  Old  view:  Mostly  a  

“messenger”.  –  New  view:  Performs  many  

important  func8ons.  3)  Proteins  –  Perform  most  cellular  func8ons  

(biochemistry,  signaling,  control,  etc.)  

Page 5: Ben%Raphael% September%12,%2011%

9/14/11  

5  

DNA  

…ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

Double  helix  Two  strands.  

Each  strand  composed  of  sequence  of  covalently  bonded  nucleo(des  (bases).    Four  possible  nucleo8des:  A,  C,  T,  G  

DNA  sequence:  string  from  4  character  alphabet  

DNA  3’  5’  

5’  3’  

Bases  (nucleo(des)  in  each  strand  are  complementary:  A  ßà  C,        T  ßàG    Watson-­‐Crick  base-­‐pairing    

Page 6: Ben%Raphael% September%12,%2011%

9/14/11  

6  

RNA  •  RNA:  single  stranded,  4  leDer  alphabet:  A,  C,  U,  G    (U  =  T)  

•  Can  fold  into  structures  due  to  base  complementarity.  

•  Many  classes  of  RNA  –  mRNA  (messenger:  transcrip8on  

•  DNA  à  RNA  à  protein)  –  tRNA  (transfer  RNA:  transla8on  

•  mRNA  à  protein  –  miRNA  –  snoRNA  –  piwiRNA  –  Etc…  

Protein  

•  String  of  amino  acids:  20  leDer  alphabet  

•  Folds  into  3D  structures  to  perform  various  func8ons  in  cells  

Page 7: Ben%Raphael% September%12,%2011%

9/14/11  

7  

Molecular  Biology  101  

Molecules  of  DNA,  RNA,  and  protein  interact  with  each  other  in  8me  and  space  

 

Computer  Science!      algorithms  on  strings,  trees,  graphs  

 

Molecules  of  DNA,  RNA,  and  protein  are  built  from  an  “alphabet”  of  “characters”.  

 

Computa8onal  Molecular  Biology  

1.  Measurement/Technology.  2.  Comparison.  3.  Predic8on.  

For  DNA,  RNA,  protein:  

Page 8: Ben%Raphael% September%12,%2011%

9/14/11  

8  

Computa8onal  Molecular  Biology  

Measurement  Derive  the  DNA  sequence  of  a  human.  Which  RNA  molecules  or  proteins  are  

present  in  a  cell  at  a  given  8me?  

No  technology  to  answer  such  ques8ons  exactly.    

 

For  DNA,  RNA,  protein:  

Measurement  Technology    +  Algorithms    

Biological  knowledge  

Genome  Sequencing  and  Comparison  

Human  Genome    Project  

“Next-­‐genera8on”    DNA  sequencing    

DNA  sequencing  

…  ACGTATTAGG  …  

Species  

Page 9: Ben%Raphael% September%12,%2011%

9/14/11  

9  

Computa8onal  Molecular  Biology  

Comparison  What  are  the  similari8es/difference  

between  sequences?  e.g.  from  human  and  chimpanzee?  [String  comparison  ßà  Algorithms]  

For  DNA,  RNA,  protein:  

Computa8onal  Molecular  Biology  

Predic8on  What  differences  in  DNA  determine  

specific  traits?    [Gene8cs]  Which  muta8ons  in  DNA  cause  cancer?    Does  a  protein  bind  to  DNA?    To  what  sequence?      How  does  the  sequence  of  RNA  or  protein  determine  

its  3D  structure?    Do  two  proteins  bind/interact?  …  

For  DNA,  RNA,  protein:  

Page 10: Ben%Raphael% September%12,%2011%

9/14/11  

10  

Why  Study  Computa8onal  Biology?  

Interdisciplinary    Biology    Computer  Science    Mathema8cs    Sta8s8cs  

=  FUN!  

WSJ:  Jan,  26,  2009.  

Why  choose  just  1?  

Course  goals  

•  Learn  applica8ons  of  computer  science,  mathema8cs,  and  sta8s8cs  to  biology.  

•  Learn  some  new  CS/mathema8cs.  •  Learn  to  read  and  cri8que  research  papers.  – Understand  how  to  formulate  a  biological  problem  as  a  computa8onal  problem.  

•  Undertake  a  research  project  –  Course  project:    Use  your  skills  to  solve  a  biological  problem.  

•  Write  a  proposal  (NIH  style)    

 

Page 11: Ben%Raphael% September%12,%2011%

9/14/11  

11  

Course  Organiza8on  

Seminar  style  1)  Introductory  lectures  by  me:  30-­‐50%  of  class  mee8ngs.  2)  Student-­‐led  paper  presenta8ons  and  discussion.  50-­‐70%  

3)  One  computer  assignment:  next-­‐gen  sequencing  data.  4)  Project.    Groups  of  1-­‐3  students.    Proposal:  NIH  style        Ini8al  proposal        Midterm  report        Final  report    Project  presenta8on  

   

 

Course  Organiza8on  See  handout    •  Prerequisites  •  Grading  •  Department  Requirements    

 

Page 12: Ben%Raphael% September%12,%2011%

9/14/11  

12  

Topics  in  Computa8onal  Biology  

Networks   Cancer  

Genomes  

Genome  Sequencing  and  Comparison  

Human  Genome    Project  

“Next-­‐genera8on”    DNA  sequencing    

DNA  sequencing  

…  ACGTATTAGG  …  

Species  

Page 13: Ben%Raphael% September%12,%2011%

9/14/11  

13  

DNA  Replica8on  and  Muta8on  

…  CGTAATTAG  …  

…  CGTCATTAG  …  

Species  

Gene

8c  

Soma8

c  

Compara8ve  genomics  Differences  between  species?  

Genome  Sequencing  and  Comparison  

Personal  genomics  Gene8c  basis  for  traits  of  individuals?    

Cancer  genomics  Which  soma8c  muta8ons  lead  to  cancer?  

Page 14: Ben%Raphael% September%12,%2011%

9/14/11  

14  

Course  Topics  

1.   Measurement  of  genomes  

 

 

Human  genome:    ~3  billion  leDers  

Algorithms  

Next-­‐genera(on  DNA  sequencing   Genome  and  Muta(ons  

10-­‐100’s  million  reads:  30-­‐1000  leDers  

…  GGTAGTTAG  …  

…  TATAATTAG  …  

…  AGCCATTAG  …  

…  CGTACCTAG  …  

…  CATTCAGTAG  …  

…  GGTAAACTAG  …  

Course  Topics  

Genome  Millions  -­‐billions  nucleo8des  

Next-­‐genera8on  DNA  sequencing  

10-­‐100’s  million  reads  Reads:  30-­‐1000  nucleo8des  

…  GGTAGTTAG  …  

…  TATAATTAG  …  

…  AGCCATTAG  …  

…  CGTACCTAG  …  

…  CATTCAGTAG  …  

…  GGTAAACTAG  …  

1.  “Resequencing”  algorithms    (also  RNA-­‐Seq,  ChIP-­‐Seq,  *-­‐Seq)  

2.  De  novo  assembly  algorithms  

Page 15: Ben%Raphael% September%12,%2011%

9/14/11  

15  

Course  Topics  

gene  

Algorithms  

Iden8fy  func8onal  variants  

Cancer  gene8cs  Which  soma8c  muta8ons  lead  to  cancer?  

Personal  genomics  Gene8c  basis  for  traits  of  individuals?    

2.  Interpreta(on  of  genomes  

 

 

Course  Topics  

3.  Beyond  the  genome:  epigene8cs  

 

 

Human  genome:    ~3  billion  leDers  

Algorithms  

Next-­‐genera(on  DNA  sequencing   Genome  Modifica(ons  

 DNA  Methyla(on    Packaging  in  chroma(n  

Page 16: Ben%Raphael% September%12,%2011%

9/14/11  

16  

Molecular  Biology  101  

Three  fundamental  molecules.  1)  DNA  –   Informa8on  storage.      

2)  RNA  –  Old  view:  Mostly  a  

“messenger”.  –  New  view:  Performs  many  

important  func8ons.  3)  Proteins  –  Perform  most  cellular  func8ons  

(biochemistry,  signaling,  control,  etc.)  

What  molecules  can  we  measure?  

Sequencing  (expensive,  high  quality)  Hybridiza8on  (inexpensive,lower  qual)  

Sequencing  (expensive,  high  qual)  Hybridiza8on  (noisy,  lower  qual)  

Mass  spectrometry  (low  quality)  Hybridiza8on  (very  low  quality!)  

Page 17: Ben%Raphael% September%12,%2011%

9/14/11  

17  

DNA  sequencing  

How  we  obtain  the  sequence  of  nucleo8des  of  a  species  

…ACGTGACTGAGGACCGTG CGACTGAGACTGACTGGGT CTAGCTAGACTACGTTTTA TATATATATACGTCGTCGT ACTGATGACTAGATTACAG ACTGATTTAGATACCTGAC TGATTTTAAAAAAATATT…

DNA  Sequencing  Goal:    Find  the  complete  sequence  of  A,  C,  G,  T’s  in  a  DNA  molecule.    Human  genome:  String  of  3  billion  nucleo8des.  

Challenge:    There  is  no  machine  that  takes  long  DNA  as  an  input,  and  gives  the  complete  sequence  as  output  

   Can  only  sequence  <1000  leDers  at  a  8me  

Page 18: Ben%Raphael% September%12,%2011%

9/14/11  

18  

Shotgun  Sequencing  

cut many times at random (shotgun)

genomic segment

Get one or two reads from each

segment ~500 bp ~500 bp

Fragment  Assembly  

Cover region with ~N-fold redundancy Overlap reads and extend to reconstruct the

original genomic region

reads

First  Topic:  Algorithms  to  solve  fragment  assembly  problem  

Page 19: Ben%Raphael% September%12,%2011%

9/14/11  

19  

Progress  in  Genome  Sequencing  

454   Illumina   ABI  SOLiD  

Human  Genome    Project  

“Next-­‐genera8on”    DNA  sequencing    

~500 bp ~500 bp

~25-100 bp ~25-100 bp

Old  technology  

New  technologies  

Progress  in  Genome  Sequencing  

454   Illumina   ABI  SOLiD  

Human  Genome    Project  

“Next-­‐genera8on”    DNA  sequencing    

~500 bp ~500 bp

~25-100 bp ~25-100 bp

Page 20: Ben%Raphael% September%12,%2011%

9/14/11  

20  

“Third-­‐genera8on”  DNA  Sequencing  

ACGTGCATGxxxxxxxxxxxxxTGCTAGCTGGxxxxxxxxxxxxxxCCTATGGTA

subread   subread   subread  advance   advance  3-­‐strobe  

Strobe  sequencing:  Mul8ple  reads  per  fragment  

!" #" $"

%" &" '"

!(" #(" $)"

'("&)"%("

$("

&("

#)"

*+,+-./"

#("!("

#)"

$("

$)"

$("$)"

#)"

%(" &)"

%("&("

'("

'("

!"#$ !%#$!("

#("

01231"

01231" +/*"

+/*"

Ritz,  Bashir,  and  Raphael  (2010).    Bioinforma8cs.  

Single  Molecule  Real  Time  (SMRT)  Sequencing  from  Pacific  Biosciences  

Course  Topics  

gene  

Algorithms  

Iden8fy  func8onal  variants  

Cancer  gene8cs  Which  soma8c  muta8ons  lead  to  cancer?  

Personal  genomics  Gene8c  basis  for  traits  of  individuals?    

2.  Interpreta(on  of  genomes    

 

Page 21: Ben%Raphael% September%12,%2011%

9/14/11  

21  

Soma8c  Muta8ons  and  Cancer  

Time  (cell  divisions)  

Passenger  muta8ons  

Driver  muta8on  

“typical  tumor”:      ~10  driver  muta8ons          100’s  –  1000’s  of  passenger  muta8ons          

Sequence  genome  

Founder    cell  

Cell  po

pula8o

n    

Clonal  Theory  (Nowell  1976)  

Cancer  Genomes  

Breast  

Leukemia  

Page 22: Ben%Raphael% September%12,%2011%

9/14/11  

22  

Rearrangements  in  Tumors  Change  gene  structure,  create  novel  fusion  genes  

Gleevec  (Novar8s  2001)  targets  ABL-­‐BCR  fusion  

Func8onal  Muta8ons  

Next-­‐genera8on  DNA  sequencing   gene   :  soma8c  muta8on  

Recurrent  muta8ons/mutated  genes  à  func8onal    

91  glioblastoma    mul8forme  samples    

Page 16 N

IH-P

A A

uthor Manuscript

NIH

-PA

Author M

anuscript N

IH-P

A A

uthor Manuscript

Figure 1. Significant copy number aberrations and pattern of somatic mutations (a) Frequency and significance of focal high-level copy-number alterations. Known and putative target genes are listed for each significant CNA, with “Number of Genes” denoting the total number of genes within each focal CNA boundary. (b–c) Distribution of the number of (b) silent and (c) non-silent mutations across the 91 GBM samples separated according to their treatment status, showing hypermutation in 7 out of the 19 treated samples. (d) Significantly mutated genes in 91 glioblastomas. The eight genes attaining a false discovery rate <0.1 are displayed here. Somatic mutations occurring in untreated samples are in dark blue; those found in statistically non-hypermutated and hypermutated samples among the treated cohort are in respectively lighter shades of blue.

Nature. Author manuscript; available in PMC 2009 April 23.

Page 23: Ben%Raphael% September%12,%2011%

9/14/11  

23  

Recurrent  Muta8ons  

Cancer  is  a  disease  of  “pathways”.  

What  pathways  are  altered/mutated?  

36%  

2%  

6%  

18%  

21%  

3%  2%   5%   3%  

4%  

                   [Hanahan  and  Weinberg,  Cell  2000]  

Biological  Interac8on  Networks  Many  types:  •  Protein-­‐DNA  (regulatory)  

•  Protein-­‐metabolite  (metabolic)  

•  Protein-­‐protein  (signaling)  

•  RNA-­‐RNA  (regulatory)  

•  Gene8c  interac8ons  (gene  knockouts)  

Page 24: Ben%Raphael% September%12,%2011%

9/14/11  

24  

Protein-­‐Protein  Interac8on  Network?  

•  Proteins  are  nodes  •  Interac8ons  are  edges  •  Edges  may  have  weights  

Yeast  PPI  network  

H.  Jeong  et  al.  Nature  411,  41  (2001)  

Problem  Given:    

1.  Large-­‐scale  interac8on  network  2.  Muta8on  data  from  mul8ple  cancer  

pa8ents            Find:  Subnetworks  mutated  in  a  significant  number  of  samples  

gene  

=  muta8on  

Page 25: Ben%Raphael% September%12,%2011%

9/14/11  

25  

Soma8c  Muta8ons  and  Cancer  

Time  (cell  divisions)  

Passenger  muta8ons  

Driver  muta8on  

“typical  tumor”:      ~10  driver  muta8ons          100’s  –  1000’s  of  passenger  muta8ons          

Sequence  genome  

Founder    cell  

Cell  po

pula8o

n    

Clonal  Theory  (Nowell  1976)  

Given:  cross-­‐sec8onal  data  from  many  individuals.      Goal:  Infer  order  of  muta8ons.  

Course  Topics  

gene  

Algorithms  

Iden8fy  func8onal  variants  

Personal  genomics  Gene8c  basis  for  traits  of  individuals?    

2.  Interpreta(on  of  genomes  

 

 

Cases  

Controls  

Page 26: Ben%Raphael% September%12,%2011%

9/14/11  

26  

Course  Topics  

3.  Beyond  the  genome:  epigene8cs  

 

 

Human  genome:    ~3  billion  leDers  

Algorithms  

Next-­‐genera(on  DNA  sequencing   Genome  Modifica(ons  

 DNA  Methyla(on    Packaging  in  chroma(n  

Course  Topics  

Machine  Learning  

Networks   Cancer  

Genomes  

Algorithms  

Systems/Programming  

Data  Mining/Analysis/Visualiza8on  

Page 27: Ben%Raphael% September%12,%2011%

9/14/11  

27  

Reading  

•  Biology  background.  •  Probability  background.  •  DNA  sequencing  technologies  •  See  course  website.