1-3-20041 computational functional genomics (26-be-790) (statistical models in computational...

24
1-3-2004 1 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, [email protected] Teaching Assistants: Johannes Freudenberg (Bioinformatics), Junhai Guo (Biostatistics), //eh3.uc.edu/ComputationalFunctionalGenomics

Upload: laureen-bishop

Post on 11-Jan-2016

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 1

Computational Functional Genomics(26-BE-790)

(Statistical Models in Computational Biology)

Instructor:Mario Medvedovic, [email protected]

Teaching Assistants:Johannes Freudenberg (Bioinformatics),Junhai Guo (Biostatistics),

http://eh3.uc.edu/ComputationalFunctionalGenomics.html

Page 2: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 2

Course Outline• Everything will be posted on the web-site

– lecture slides, links to the papers to read, syllabus, computer programs, data, homework, etc

• The course will start from very beginning in three different areas: – Molecular genetics

– Statistics and probability

– Programming

• People with different backgrounds will need to focus their efforts differently

• Independent readings and practice is expected

• Access to a reasonably good PC computer with ability to install additional software is absolutely necessary

• The focus of the course is analysis of microarray data: experimental design, normalization, identification of differentially expressed genes, cluster analysis and microarray data base classification.

• Towards the end, statistical models for regulatory motifs will also be discussed

• If time permits, applications of general graphical models will also be discussed

Page 3: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 3

Course Outline

• Everything will be posted on the web-site– lecture slides, links to the papers to read, syllabus, computer programs, data, homework, etc

• The course will start from very beginning in three different areas: – Molecular genetics

– Statistics and probability

– Programming

• People with different backgrounds will need to focus their efforts differently

• Independent readings and practice is expected

• Access to a reasonably good PC computer with ability to install additional software is absolutely necessary

• Those without an access to a decent computer need to send me an email right away

• The focus of the course is analysis of microarray data: experimental design, normalization, identification of differentially expressed genes, cluster analysis and microarray data base classification.

• Getting to actual practical microarray data analysis very quickly – next lecture

• Filling in gaps as we go

Page 4: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 4

Course Outline• Basic concepts of molecular genetics, microarray technology, sources of variability, motivation of the

need for statistical analysis Introduction to programming and data analysis using R and Bioconductor. Basics of probability theory (random events, probability, random variables, probability distributions,

conditional probability) Basics of statistical inference (statistical models, random sample, parameter estimation, hypothesis

testing, p-value) Identifying differentially expressed genes (normalization approaches, t-test, multiple comparison

adjustments) Cluster analysis and post-hoc analyses Mid-term exam (in-class) Elements of Experimental design as applied to microarray data (Random block design, Confounding,

Analysis of Variance, Elements of optimal design) Basics of Bayesian statistical inference (Bayes theorem, Beta-Binomial and Gamma-Normal models,

Empirical Bayes approach, Hierarchical models) Statistical models in cluster analysis (hierarchical approaches, partitioning approaches, mixture model

based clustering, EM algorithm, Gibbs sampling) Statistical models and computational tools for identifying genomic regulatory elements Bayesian graphical models in functional genomics Final Project

Page 5: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 5

ReferencesNo single universal reference textbook Peter Delgraad. Introductory Statistics with R. Springer-

Verlag, NY, 2002. Statistical Analysis of Gene Expression Microarray Data.

Speed T. The Analysis of Gene Expression Data: Methods and

Software. Parmigiani, G., Garrett, E.S., Irizarry, R.A., Zeger, S.L.

Statistical methods in bioinformatics: an introduction / Warren J. Ewens, Gregory R. Grant

Bioinformatics: The machine learning approach/ Baldi, P., Brunak, S.

Page 6: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 6

Lecture Outline• Molecular genetics – “The Central Dogma”

• Functional Genomics – assigning function to genes

• Gene Expression– Functional Genomics Data – Microarrays

– Transcription and Regulatory motifs

– Computational Functional Genomics• Very wide area

• Computational analysis of functional genomics data

• Computational methods just a “front” of underlying statistical methods

• Stochasticity of functional genomics data and molecular biology in general

– Measurement error

– “Biologic variability”

– Stochasticity of underlying molecular processes

• Results in “noisy” data with significant stochastic components (microarray data, transcription factor binding motifs, protein folds, etc)

Page 7: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 7

DNA• In the nucleus of Eukaryotic cells

• A linear polymer of 4 nucleotides (A,C,G,T)

• Two strands of DNA for double helix by specific pairing of their nucleotides (A-T,C-G)

…AGCTGGCGGT…

…TCGACCGCCA…• The specificity of pairings is used for preserving

genetic information during the cell division – individual strands of the double helix are separated and two identical copies are created by filling in appropriate nucleotides

• Genes are portions of DNA coding for proteins

• Proteins are the functional molecules in a living system

• Proteins are linear polymers of 20 amino acids

• DNA encodes different proteins through the “genetic code” – each three letters code for one amino acid

…AGCTGGCGGT…

…-Ser - Trp –Arg-…

Page 8: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 8

DNA Replication

Page 9: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 9

The Central Dogma – From Information to Function

• Translating Information stored in the DNA into function – protein production

• mRNA carries the information from the nucleus to cytoplasm where proteins are produced

• Transcription is the process of “copying” the genetic information from DNA into mRNA

• Translation is the process of protein synthesis based by decoding the mRNA sequence

• Genome of the cell is the DNA - static

• Transcriptome of a cell are all mRNA molecules in the cell – dynamic

• Proteome of a cell are all proteins in the cell – dynamic

• Cell maintains proper functioning by regulating its protein levels

• A major mechanism for regulating protein levels is regulation of mRNA levels

Page 10: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 10

Functional Genomics• n : the branch of genomics that determines the

biological function of the genes and their products– Source: WordNet ® 2.0, © 2003 Princeton University

• Functional genomics data– Data that facilitates assigning function to genes or is directly

assessing gene function (DNA/Protein sequence, 3D protein structure, mRNA levels measurements, etc.)

• Computational functional genomics (as assumed in this course)– Computational methods that facilitate application of

appropriate mathematical/statistical models for analysis and interpretation of functional genomics data

– In a broader sense, computational approaches to functional genomics

Page 11: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 11

Reading MaterialsOnline Reading (in the suggested order):• An Introduction to biocomputing

– http://www.techfak.uni-bielefeld.de/bcd/Curric/Introd/ch0.html• Kimball’s Biology Pages – an online hypertext “textbook”

– http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/– http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/T/Transcription.html– http://users.rcn.com/jkimball.ma.ultranet/BiologyPages/T/Translation.html

Traditional References• Lodish, H. et al. Molecular Cell Biology. (Ch1),Ch2, (Ch3), Ch4• Lewin, B. Genes. Ch1-Ch3

Courses to Take• Introduction to Molecular Genetics

Page 12: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 12

Microarray Technology – Measuring levels of all mRNA species in parallel

• Base-pairing or hybridization is the underlining principle of DNA microarray.

–Identify a representative fragment of a gene’s coding sequence (e.g. TCGACCGCCA)

–Synthesize corresponding DNA fragments–Place the such “probes” on the glass slide–Repeat the process for all gene genes you want to include on

the microarray and place each on a pre-defined position on the glass slide

–Some fancy technology is used to actually place up to 40k spots of DNA on a microscope slide

TCGACCGCCA

Page 13: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 13

•mRNA is extracted from the biological sample of interest

•mRNA is labeled by using a fluorescence dye

•Microarray is “hybridized” with labeled mRNA

“Single-channel” Microarrays – Experimental Protocol

Biological Sample

Extracted mRNA

Labeled mRNA

Page 14: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 14

Hybridization Reaction

• Labeled mRNA fragments are floating around in search of its complementary DNA fragments immobilized on the microarray slide

• The amount of the labeled mRNA that “sticks” to a “spot” representing a gene is proportional to the “copy” number of the corresponding mRNA

• The amount of labeled mRNA “stuck” to each spot is quantitated by measuring the fluorescence intensity of each spot

• The real-world dynamics of this process is complex and there is no simple relationship between the quantitative measurement of fluorescence and the actual number of copies of each mRNA

Page 15: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 15

“Two-channel” Microarrays – Experimental Protocol

•Direct assessment of relative abundance of different mRNA species

•mRNA extracted from two different biological samples is labeled with different fluorescence dies (usually Cy-3 and Cy-5)

•Two pools of labeled mRNA are “co-hybridized” on a single microarray

•After quantitating individual dye intensities, the results are can be represented using almost notorious shades of green and red

Page 16: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 16

Color Coding of Intensity Ratios

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

Scanning the “Green Channel”(XG)

Scanning the “Red Channel”(XR)

Scanning the “Red Channel”(XR)

G

RX

X )Xlog()log(X GR

• The particular shade for each pixel of a spot on a microarray is calculated by a computer program based on the (log)ratio of the two intensity measurements

• The process of quantitating fluorescence intensities consists of several semi-automated steps:

• Identification of the position of all spots on the microarray

• Determination of the “foreground” and the “background” area for each spot

• Segmentation of the “spots” – measuring intensity of all pixels in the area

• Summarizing the intensity of individual pixels (mean or median, variability measures, etc)

Page 17: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 17

Graphical Presentation of Data From a Single Microarray

•Scatter plot of fluorescence intensity (>6000 genes)•Row measurement plotted on the “logarithmic axes” – equivalent to plotting log-transformed data using regular “linear axes”•Points close to the 45o line represent genes with similar expression in the two samples•Points far away from the 45o line suggest differentially expressed genes•In this experiment same sample was split in two and labeled with two different dyes – we don’t expect any differentially expressed genes•Red dots represent “spiked” control RNA species that should be

Page 18: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 18

Two Technical Replicates

2 3 4 5 6 7 8 9 10

Control - Experiment 1

3

5

7

9

Treatmen

t - Experimen

t 1

2 3 4 5 6 7 8 9 10

Control - Experiment 2

1

3

5

7

9

Treatmen

t - Experimen

t 2

LR1=TE1-CE1LR2=TE2-CE2

2 3 4 5 6 7 8 9 10

Control - Experiment 1

3

5

7

9

Treatmen

t - Experimen

t 1

2 3 4 5 6 7 8 9 10

Control - Experiment 2

1

3

5

7

9

Treatmen

t - Experimen

t 2

LR1=TE1-CE1LR2=TE2-CE2

•What happens if we measure the same thing twice?•The original •Do we expect to get the same log-expression ratios?•What does “same” really mean?•Scatter plots of all gene expression values seem pretty similar…

Page 19: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 19

Experimental Variability – Histogram describing the “distribution” of differences in Log-ratios between two replicated

experiments

4 2 2 4

Fold Changes in Replicated Experiments

Maximum Change = 4 fold

Log2

4 2 2 4

Fold Changes in Replicated Experiments

Maximum Change = 4 fold

Log2

LR = LR1- LR2•Differences between two replicated measurements of expression ratios can be up-to 4-fold!•What is the “correct” ratio for a given gene?•Expression measurements have a stochastic component•The expression ratio can be characterized by a statistical model (i.e. probability distribution) that defines the “probability” of an outcome•Probability of an outcome in a experiment can be defined as the proportion of times that this particular outcome would occur in a very large (“infinite”) number of replicated experiments•The appropriate statistical model for a particular experiment can be postulated by considering the nature of the experiment, the underlying physical nature of the experiment, and by exploratory data analysis

•The Histogram can be used as an “empirical” model by assuming that the probability of the outcome occurring within a specific interval is equal to the observed proportion of measurements in this interval•Various re-sampling and randomization approaches for establishing statistical significance are based on this assumption•Sometimes (wrongly) considered inherently superior to “parametric” method

Page 20: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 20

• LR=Log expression ratio (observed)

=Mean expression ratio (assumed fixed – represents the signal of interest). This value is also the “expectation” of the LR, or the average of a very large (infinitely many) observations

=Standard Deviation – quantifying the variability of observations.

(Parametric) Statistical Model for Log Gene Expression Ratio Measurements

2

2

μ)(LR2

σ2π

1σ,μ | LR (

efN ))σ,μ(~LR 2N

4

3

2 )σμ,|(LR4LR(3 NfP )

• fN =The probability distribution function (pdf) – the probability of any observed LR being in a given interval is the area under the curve defined by the pdf above this interval. The total area under the whole curve is equal to 1. Pdf can be interpreted as the histogram for a very large number of measurements (infinite) when the width of boxes is made very small (very close to zero)

-2 0 2 4 6

-2 0 2 4 6

-2 0 2 4 6

LR

-2 0 2 4 6

-2 0 2 4 6

-2 0 2 4 6

LR

Page 21: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 21

Transcription and Transcriptional Regulation – Grossly Over-Simplified

Transcription Factor

General TranscriptionFactors

ACGCGTAA

Regulatory Motif

TATAAA

Tata Box

Coding Region

RNAPolymerase

• Transcription of a gene is initiated by a transcription factor that specifically binds to a “regulatory motif” in the gene’s regulatory region

• A number of other proteins (general transcription factors) are recruited and bind to DNA in the proximity of the transcription start site

• Finally, the RNA Polymerase, the protein that performs the synthesis of mRNA is recruited and the transcription is initiated

• Transcriptional regulation one of the most important mechanism for a cell to respond to external stimuli, and the cell-type specific gene expression defines the nature of different cells in a multicellular organism

Page 22: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 22

Statistical Model of TF-Binding Motifs

...12

0;

12

9;

12

8

...12

0;

12

0;

12

1

...12

0;

12

2;

12

2

...12

12;

12

1;

12

1

7-T

8-T

9-T

7-G

8-G

9-G

7-A

8-C

9-C

7-A

8-A

9-A

ppp

ppp

ppp

ppp

)1,,,,(~)( 999999 AAAAN ppppMULTpNp

•If you identify a portion of the promoter region that is bound by these two TF’s, the identity of different nucleotides at different positions within the motif will be to some extend random •For a specific position in the motif, multinomial model for probability of occurrence of a specific nucleotide is:

9

9

9889 )...( i

i

N ipNNNNp

• The product-multinomial model for probability of a whole sequence recognized by these TFs (assuming the independence between different positions in the motif) is:

Page 23: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 23

Stochasticity of Protein Folds• 3D Protein structure is often considered as the ultimate determinant of its function

• It turns out that a more accurate description of the 3D protein structure is a probability distribution over different possible confirmations. In some cases major features of the structure are preserved across a whole set of highly probably conformation. However, in some cases the

• The differences and the uncertainties related to the 3D protein structure are due to thermodynamic fluctuations which are themselves inherently stochastic

1AEY1MBA 1B8Q

Page 24: 1-3-20041 Computational Functional Genomics (26-BE-790) (Statistical Models in Computational Biology) Instructor: Mario Medvedovic, Mario.Medvedovic@uc.edu

1-3-2004 24

• Model parameters represent “population” properties of our measurements.

• The conclusions about the phenomenon under investigation are made in terms of (unknown) population parameters

• Example: is the log-ratio of expression measurements for a gene between two different types of biological samples on average greater than zero? (i.e. >0)

Estimating Model Parameters from Data

)σ,μ(~LR 2N

• Actual measurements (sample) are used to calculate sample-parameters that are used as estimates of population parameters

• Example: if we have n replicated microarray experiment, the average of observed log-ratios can be used to estimate the underlying population mean

-2 0 2 4 6

-2 0 2 4 6

-2 0 2 4 6

LR

-2 0 2 4 6

-2 0 2 4 6

-2 0 2 4 6

LR