microarrays. regulation of gene expression cells respond to environment heat food supply responds to...
Post on 20-Dec-2015
223 views
TRANSCRIPT
Microarrays
Regulation of Gene Expression
Cells respond to environment
Heat
FoodSupply
Responds toenvironmentalconditions
Various external messages
Where gene regulation takes place
• Opening of chromatin
• Transcription
• Translation
• Protein stability
• Protein modifications
Transcriptional Regulation
• Strongest regulation happens during transcription
• Best place to regulate: No energy wasted making intermediate products
• However, slow response timeAfter a receptor notices a change:
1. Cascade message to nucleus
2. Open chromatin & bind transcription factors
3. Recruit RNA polymerase and transcribe
4. Splice mRNA and send to cytoplasm
5. Translate into protein
Transcription Factors Binding to DNA
Transcription regulation:
Certain transcription factors bind DNA
Binding recognizes DNA substrings:
Regulatory motifs
RNA Polymerase
TBP
Promoter and Enhancers
• Promoter necessary to start transcription
• Enhancers can affect transcription from afar
Enhancer 1 Enhancer 1 Enhancer 1
TATA box
Gene X
DNA binding sites
Transcription factors
Example: A Human heat shock protein
• TATA box: positioning transcription start
• TATA, CCAAT: constitutive transcription• GRE: glucocorticoid response• MRE: metal response• HSE: heat shock element
TATASP1CCAAT AP2HSEAP2CCAATSP1
promoter of heat shock hsp70
0--158
GENE
Motifs:
The Cell as a Regulatory Network
A B Make DC
If C then D
If B then NOT D
If A and B then D D
Make BD
If D then B
C
gene D
gene B
B
Promoter D
Promoter B
DNA Microarrays
Measuring gene transcription in a high-throughput fashion
Measuring transcription
AAAAAAAAA
Gene (DNA)
Transcript (RNA)
RNA polymerase – cellular enzyme
AAAAAAAAATTTTTTTTT
Synthetic primer (oligo dT)
Reverse transcriptase (RT) – Retroviral enzyme
- Flourescence tags
Extract RNA
Complementary DNA (cDNA)
Expression ~ RNA ~ flourescence
What is a microarray
What is a microarray (2)
• A 2D array of DNA sequences from thousands of genes
• Each spot has many copies of same gene
• Allow mRNAs from a sample to hybridize
• Measure number of hybridizations per spot
How to make a microarray
• Method 1: Printed Slides (Stanford)– Use PCR to amplify a 1 kb portion of each gene /
EST– Apply each sample on glass slide
• Method 2: DNA Chips (Affymetrix)– Grow oligonucleotides (20bp) on glass– Several words per gene (choose unique words)
If we know the gene sequences,
Can sample all genes in one experiment!
Microarray Experiment
RT-PCR
RT-PCR
LASER
DNA “Chip”
High glucose
Low glucose
Raw data – images
• Red (Cy5) dot – overexpressed or up-regulated
• Green (Cy3) dot – underexpressed or down-
regulated• Yellow dot
– equally expressed
• Intensity - “absolute” levelcDNA plotted microarray
Levels of analysis
• Level 1: Which genes are induced / repressed?Gives a good understanding of the biologyMethods: Factor-2 rule, t-test.
• Level 2: Which genes are co-regulated? Inference of function.-Clustering algorithms.
•Level 3: Which genes regulate others?Reconstruction of networks.- Transcriptions factor binding sites.
Experiment: time course
Time 0G
enes
Sample annotations
Gene annotations
Intensity (Red)Intensity (Green)
Experiment: time course
Time 0.5
Gen
esIntensity (Red)Intensity (Green)
Time 0
Experiment: time courseG
enes
00 0.50 20 50 70 90 110
Time (hours)
Gene expression database
Gen
es
Gene expression levels
Samples Sample annotations
Gene annotations
Gene expression matrix
Gene expression database
SamplesG
enes Gene expression
matrix
Timeseries,Conditions A, B, …Mutants in genes a, b …Etc.
Data normalization expression of gen x in experiment i expression of gen x in reference
Logarithm of ratio - treats induction and repression of identical
magnitude as numerical equal but with opposite sign.
red/green - ratio of expression– 2 - 2x overexpressed– 0.5 - 2x underexpressed
log2( red/green ) - “log ratio”– 1 2x overexpressed– -1 2x underexpressed
Xi log(Ei / Ri).
Analysis of multiple experiments
Xi log(Ei / Ri).
.,...,1 mXXX
Expression of gene x in m experiments can berepresented by an expression vector with m elements
Z-transformation:If
X ~ N(),
.
)(Xstdev
XXX i
i
.1
m
XX
m
ii
.
X
Z
Level 1
• 2-fold rule: Is a gene 2-fold up (or down) regulated?
• Students t-test: Is the regulation significantly different from background variation? (Needs repeated measurements)
T-test
X ~ N(), .: XH a
.:0 XHCannot reject H0
Reject H0 .
m
XZ
The p-value is the probability of drawing the wrong conclusion by rejecting a null hypothesis
Multiple testing
In a microarray experiment, we perform 1 test / gene
Prob (correct) = 1 – c
Prob (globally correct) = (1 – cn
Prob (wrong somewhere) = 1 - (1 – cn
e = 1 - (1 – cn
For small e : c en
Bonferroni correction for multiple testing ofindependent events
Single comparison
Experiment comparison
Multiple testing
Genes Treated 1 Treated 2 Control 1 Control 2 p-value
Gene 1 0.659081 0.97234 0.372675 0.69511 0.010362
Gene 2 0.341119 0.100549 0.56026 0.285965 0.052948
Gene 3 0.667136 0.29554 0.498284 0.019279 0.150739
Gene 4 0.880788 0.871784 0.552085 0.208167 0.20722
Gene 5 0.092942 0.756629 0.488266 0.84595 0.358535
Gene 6 0.07958 0.736049 0.022873 0.406469 0.391526
Gene 7 0.534497 0.146925 0.659746 0.951731 0.401714
Gene 8 0.062087 0.678039 0.979814 0.795904 0.418683
Gene 9 0.224166 0.17082 0.650215 0.16222 0.512849
Gene 10 0.372998 0.184738 0.353879 0.451197 0.545602
Gene 11 0.537619 0.853997 0.606766 0.083149 0.556954
Gene 12 0.232855 0.77575 0.275746 0.438622 0.58056
Gene 13 0.760863 0.508516 0.823947 0.074637 0.591919
Gene 14 0.568507 0.932771 0.72373 0.027096 0.60806
Gene 15 0.838437 0.549377 0.92673 0.100789 0.623721
Gene 16 0.017407 0.723751 0.310977 0.220452 0.836162
Gene 17 0.893638 0.293472 0.542273 0.886285 0.840617
Gene 18 0.536479 0.887943 0.859521 0.382404 0.861986
Gene 19 0.675622 0.604696 0.445713 0.916473 0.904506
Gene 20 0.836653 0.397073 0.438522 0.778742 0.986562
0.05
Significance
level
Clustering
Hierachical clustering: - Transforms n (genes) * m (experiments) matrixinto a diagonal n * n similarity (or distance) matrix
Similarity (or distance) measures:Euclidic distancePearsons correlation coefficent
Eisen et al. 1998 PNAS 95:14863-14868
Vectors in space: distances
Gene 1
Gene 2
Experiment 1
Experiment 3Experiment 2
d
Distance Measures: Minkowski Metric
r rm
iii
m
m
yxyxd
yyyy
xxxx
myx
||),(
)(
)(
1
21
21
by defined is metric Minkowski The
:features have both and objects two Suppose
Most Common Minkowski Metrics
||max),(
||),(
1
||),(
2
1
1
2 2
1
iimi
m
iii
m
iii
yxyxd
r
yxyxd
r
yxyxd
r
)distance sup"(" 3,
distance) (Manhattan 2,
) distance (Euclidean 1,
An Example
.4}3,4{max
.734
.5342 22
:distance sup"" 3,
:distance Manhattan 2,
:distance Euclidean 1,
4
3
x
y
Similarity Measures: Correlation Coefficient
. and :averages
)()(
))((),(
1
1
1
1
1 1
22
1
m
iim
m
iim
m
i
m
iii
m
iii
yyxx
yyxx
yyxxyxs
1),( yxs
Similarity Measures: Correlation Coefficient
Time
Gene A
Gene B Gene A
Time
Gene B
Expression LevelExpression Level
Expression Level
Time
Gene A
Gene B
Clustering of Genes and Conditions
• Unsupervised:– Hierarchical clustering– K-means clustering– Self Organizing Maps (SOMs)
Ordered dendrograms
Hierachical clustering:Hypothesis: guilt-by-associationCommon regulation -> common function
Eisen98
Hierarchical Clustering
Given a set of n items to be clustered, and an n*n distance (or similarity) matrix, the basic process hierarchical clustering is this:
1. Start by assigning each item to its own cluster, so that if you have n items, you now have n clusters, each containing just one item. Let the distances (similarities) between the clusters equal the distances (similarities) between the items they contain.
2. Find the closest (most similar) pair of clusters and merge them into a single cluster, so that now you have one less cluster.
3. Compute distances (similarities) between the new cluster and each of the old clusters.
4. Repeat steps 2 and 3 until all items are clustered into a single cluster of size N.
Merge two clusters by:
• Single-Link Method / Nearest Neighbor (NN): minimum of pairwise dissimilarities
• Complete-Link / Furthest Neighbor (FN): maximum of pairwise dissimilarities
• Unweighted Pair Group Method with Arithmetic Mean (UPGMA): average of pairwise dissimilarities
453652
cba
dcb
453,
cba
dc
Single-Link Method
453652
cba
dcb
Diagonal n*n distance Matrix
Euclidean Distance
ba
c d
(1)
c d
a,b
(2)
a,b,cd
(3)
a,b,c,d
4,, cbad
453652
cba
dcb
Complete-Link Method
ba
453652
cba
dcb
Distance Matrix
Euclidean Distance
465,
cba
dc6,,
badc
(1) (2) (3)
a,b
cc d
a,b
d c,da,b,c,d
Compare Dendrograms
a b c d a b c d
2
4
6
0
Single-Link Complete-Link
Serum stimulation of human fibroblasts (24h) Cholesterol biosynthesis
Celle cyclusI-E responseSignalling/ Angiogenesis
Wound healning
Partitioning
• k-means clustering• Self organizing maps (SOMs)
k-means clustering
Tavazoie et al. 1999 Nature Genet. 22:281-285
k-Means Clustering Algorithm
1) Select an initial partition of k clusters
2) Assign each object to the cluster with the closest centre
3) Compute the new centres of the clusters
4) Repeat step 2 and 3 until no object changes cluster
1. centroide
1. centroide
2. centroide
3. centroide
4. centroide
5. centroide
6. centroide
k = 6
1. centroide
2. centroide
3. centroide
5. centroide
6. centroide
k = 6
1. centroide2. centroide
3. centroide
4. centroide
5. centroide
6. centroide
k = 6
Self organizing maps
Tamayo et al. 1999 PNAS 96:2907-2912
1. centroide 2. centroide 3. centroide
4. centroide 5. centroide 6. centroide
k = (2,3) = 6
k = 6
k = 6
k = 6