clustering & microarray technology...200 10000 50.00 5.64 4800 4800 1.00 0.00 9000 300 0.03 -4.91...

51
1 Clustering & microarray technology A large scale way to measure gene expression levels. Thanks to Kevin Wayne, Matt Hibbs, & SMD for a few of the slides

Upload: others

Post on 28-Jan-2021

2 views

Category:

Documents


0 download

TRANSCRIPT

  • 1

    Clustering & microarray technology

    A large scale way to measure gene expression levels.

    Thanks to Kevin Wayne, Matt Hibbs, & SMD for a few of the slides

  • 2

    Car parts

    Automobiles

    Blueprints of automobile parts

    DNA Phenotypes

    Gene Expression

    Proteins

    Proteins Why is expression important?

  • 3

    From Genes to Proteins

    Transcription: DNA to mRNA

    Translation:

    mRNA to Proteins DNA

    mRNA

    Protein

    Ribosome

  • 4

    Proteins

    Proteins are the “workhorses” of cells • To understand how cells work is to understand

    proteins

    Understanding proteins and cells is key for finding disease treatments and cures • Modern drug development is centered on

    affecting proteins (receptors, hormones, etc.)

    But… Proteins are hard to study directly, so microarrays look at the mRNA instead

  • 5

    Hybridization Expression microarrays use the fact that complementary strands will hybridize (attach) to each other

  • 6

    Early cDNA microarray (18,000 clones)

  • 7

    Microarray Methodology

  • 8

    Microarray Methodology

    Spot slide with known sequences

    A C

    B D

  • 9

    Microarray Methodology

    reference mRNA test mRNA

    Spot slide with known sequences

    Reference sample Test cells

  • 10

    Microarray Methodology

    reference mRNA test mRNA

    add green dye add red dye

    Spot slide with known sequences

  • 11

    Microarray Methodology

    Add mRNA to slide for Hybridization

    reference mRNA test mRNA

    add green dye add red dye

    hybridize

    Spot slide with known sequences

  • 12

    Microarray Methodology

    Spot slide with known sequences

    Add mRNA to slide for Hybridization

    Scan hybridized array

    reference mRNA test mRNA

    add green dye add red dye

    hybridize

  • 13

    Microarray Methodology

    Spot slide with known sequences

    Add mRNA to slide for Hybridization

    Scan hybridized array

    reference mRNA test mRNA

    add green dye add red dye

    hybridize A 1.5

    B 0.8

    C -1.2

    D 0.1

  • 14

    Microarray Methodology

    Spot slide with known sequences

    Add mRNA to slide for Hybridization

    Scan hybridized array

    reference mRNA test mRNA

    add green dye add red dye

    hybridize A 1.5

    B 0.8

    C -1.2

    D 0.1

  • 15

    Microarray Outputs

    Measure amounts of green and red dye on each spot Represent level of expression as a log ratio between these amounts

    Raw Image from Spellman et al., 98

  • 16

    Experiments

    Extracting Data

    200 10000 50.00 5.644800 4800 1.00 0.009000 300 0.03 -4.91Genes

    Cy3 Cy5Cy5Cy3 log2

    Cy5Cy3

    ⎝ ⎜ ⎜

    ⎠ ⎟ ⎟

    Extracting Data

  • 17

    Why microarray analysis: the questions

    • Large-scale study of biological processes

    • What is going on in the cell at a certain point in time? • what pathways are active, which genes are

    involved in these pathways

    • On genomic level, what accounts for differences between phenotypes? • which pathways are activated in stress

    response and which genes belong to these pathways

    • Sequence important, but genes have effect through expression

  • Introduction to Computer Science • Robert Sedgewick and Kevin Wayne • http://www.cs.Princeton.EDU/IntroCS

    Clustering

    Outbreak of cholera deaths on map in 1850s. Reference: Nina Mishra, HP Labs

    History: London physicist John Snow plotted outbreak of cholera deaths on map in 1850s. Location indicated that clusters were around certain intersections with polluted wells; this exposed the problem and solution!

  • 19

    What is clustering? Reordering of vectors in a dataset so that similar patterns are next to each other

  • 20

    Why cluster microarray data?

    • “Guilt by association” => if unknown gene i is similar in expression to known gene j, maybe they are involved in the same/related pathway • Dimensionality reduction: datasets are too big to be able to get information out without reorganizing the data

  • 21 Botstein & Brown group

  • 22 From Eisen MB, et al, PNAS 1998 95(25):14863-8

    Clustering Random vs Biological Data

    Challenge – when is clustering “real”?

  • 23

    K-means clustering

    1. Define k = number of clusters 2. Randomly initialize a seed vector for each cluster (could use a random gene vector) 3. Go through all genes, and assign each gene to the cluster which it is most similar to 4. Recalculate all seed vectors as means (or medians) of patterns of each cluster Repeat 3&4 until

  • 24

    K-means clustering: stop conditions

    • Until the change in seed vectors is less than • Until all genes get assigned to the same partition twice in a row • Until some minimal number of genes (e.g. 90%) get assigned to the same partition twice in a row

  • 25

    K-means: problems

    • Have to set k ahead of time • Each gene only belongs to 1 cluster • One cluster has no influence on the others • Genes assigned to clusters on the basis of all experiments

  • 26

    Can a gene belong to N clusters?

    • Fuzzy clustering: each gene’s relationship to a cluster is probabilistic • Gene can belong to many clusters • More biologically realistic, but harder to get to work well/fast • Harder to interpret 0.85

    0.15

  • 27

    Self Organizing Maps (SOM)

    Similar to k-means BUT: allow clusters to influence each other

  • 28

    A D

    B E

    C F

    1 2 3 4 5

    6 4 5 6 7 8

    4 5 6 1 4

    1 5 2 3 1

    0 9 0 8 8

    3 3 2 2 3

    1. Initialize the seeds for each partition

  • 29

    A D

    B E

    C F

    1 2 3 4 5

    6 4 5 6 7 8

    4 5 6 1 4

    1 5 2 3 1

    0 9 0 8 8

    3 3 2 2 3

    2. Pick a gene at random, and adjust the closest partition

    2 3 3 4 5 2 3 3 4 5

    Iteration 1.

  • 30

    A D

    B E

    C F

    1 2 3 4 5

    6 4 5 6 7

    4 5 6 1 4

    1 5 2 3 1

    0 9 0 8 8

    3 3 2 2 3

    3. Adjust neighboring partitions

    2 3 3 4 5

    5 4 4 6 5

    2 4 2 4 5 R

    Iteration 1.

  • 31

    A D

    B E

    C F

    1 2 3 4 5

    6 4 5 6 7 8

    4 5 6 1 4

    1 5 2 3 1 4

    0 9 0 8 8

    3 3 2 2 3

    2. Pick a gene at random, and adjust the closest partition

    2 3 3 4 5

    5 4 4 6 5

    2 4 2 4 5

    Iteration 2.

    0 5 1 6 6

  • 32

    Self-organizing maps algorithm

    1. Partition data (e.g. 3x2 grid) 2. Randomly choose “seed” vectors for each partition (length = # experiments) 3. Pick a gene i at random, see which partition it is most similar to (e.g. partition A), and modify A’s seed vector to be more similar to gene i 4. Now modify neighboring partitions of A to be more similar to A After map “settles down”, assign each gene to the most similar partition

  • 33

    Self-organizing maps iterations

    • At higher iterations, smaller R • At higher iterations, smaller change to partition seeds => the map “settles down”

  • 34

    Self Organizing Maps: Result • SOMs result in genes being assigned to partitions of most similar genes

    • Neighboring partitions are more similar to each other than they are to distant partitions

  • 35

    SOM: problems

    • Have to set n and m ahead of time • Each gene only belongs to 1 cluster • Genes assigned to clusters on the basis of all experiments

  • 36

    Hierarchical clustering • Imposes hierarchical structure on all of the data • Easy visualization of similarities and differences between genes (experiments) and clusters of genes (experiments)

  • 37

    How does Hierarchical Clustering work?

    1.  Compare all expression patterns to each

    other.

    2.

    Join patterns that are the most similar out of all patterns.

    3.

    Compare joined patterns to all other un-joined patterns.

    4.

    Go to step 2, and repeat until all patterns are joined.

  • 38

    Hierarchical clustering

  • 39

    Hierarchical clustering

  • 40

    Hierarchical clustering

  • 41

    Hierarchical clustering

  • 42

    Hierarchical clustering

  • 43

    Hierarchical clustering

  • 44

    Dendrogram

    •  Dendrogram. Scientific visualization of hypothetical sequence of evolutionary events. –  Leaves = genes. –  Internal nodes = hypothetical ancestors.

    Reference: http://www.biostat.wisc.edu/bmi576/fall-2003/lecture13.pdf

  • 45

    Dendrogram of Human tumors

    Tumors in similar tissues cluster together.

    Reference: Botstein & Brown group

    Gene 1

    Gene n

    gene over expressed gene under expressed

  • 46

    Analysis and Micro-Optimizations

    Running time. Proportional to MN2. Memory. Proportional to N2. Ex. [M = 50, N = 6,000] Takes 280MB, 48 sec on fast PC. Micro-optimizations. •  Use float instead of double. •  Store only lower triangular part of distance matrix. •  Use squares of distances instead of distances.

    input size proportional to MN

  • 47

    Hierarchical clustering: problems • Hard to define distinct clusters

    • Genes assigned to clusters on the basis of all experiments

    • Optimizing node ordering hard (finding the optimal solution is NP-hard)

    • Can be influenced by one strong cluster – a problem for gene expression b/c data in row space is often highly correlated

    • Hard to partition into distinct clusters

  • 48

    Distance Metrics

    •  Choice of distance measure is important for most clustering techniques

    •  Linear measures: Euclidean distance, Pearson correlation •  Non-parametric: Spearman correlation, Kendall’s tau

    d = 1n

    (xi − yi)2

    i=1

    n

    r = 1n

    xi − x σ x

    $

    % &

    '

    ( )

    i=1

    n

    ∑ yi − y σ y

    $

    % & &

    '

    ( ) )

    ρ =1−6 [rank(xi) − rank(yi)]i=1

    n

    ∑n(n2 −1)

  • 49

    public static void main(String[] args) { double[] pdata = { 1.0, 2.0, 3.0, 4.0 }; double[] qdata = { 5.0, 2.0, 4.0, 1.0 }; Vector p = new Vector(pdata); Vector q = new Vector(qdata); double dist = p.distanceTo(q); Vector r = p.scale(1.0/4.0); Vector s = q.scale(3.0/4.0); Vector t = r.plus(s); System.out.println(t); }

    Vector Data Type Vector data type. • Set of values: sequence of N real numbers. • Set of operations: distanceTo, scale, plus. Ex. p = (1, 2, 3, 4), q = (5, 2, 4, 1). • dist(p, q) = sqrt(42 + 02 + 12 + 32) = 5.099. • t = 1/4 p + 3/4 q = (4, 2, 3.75, 1.75).

  • 50

    Vector: Array Implementation

    public class Vector { private int N; // dimension private double[] data; // components // create a vector from the array d public Vector(double[] d) { N = d.length; data = d; } // return Euclidean distance from this vector a to b public double distanceTo(Vector b) { Vector a = this; double sum = 0.0; for (int i = 0; i < N; i++) sum += (a.data[i] - b.data[i])*(a.data[i] - b.data[i]); return Math.sqrt(sum); }

  • 51

    Vector: Array Implementation public String toString() { String s = ""; for (int i = 0; i < N; i++) s = s + data[i] + " "; return s; } public Vector plus(Vector b) { Vector a = this; double[] d = new double[N]; for (int i = 0; i < N; i++) d[i] = a.data[i] + b.data[i]; return new Vector(d); }

    return string representation of this vector

    return vector sum of this vector a and vector b (could add error checking to ensure dimensions agree)