clustering of dna microarray data michael slifker cis 526

Clustering of DNA Microarray Data

Michael Slifker

CIS 526

DNA Microarrays

• Measure gene expression in a sample for thousands of genes simultaneously

• Used to compare gene expression among samples

– Between individuals or treatments– Over time– Between normal tissue and tumor– Assess normal biological variation

Microarray Process

• Single-stranded DNA is printed onto slide• Extract mRNA from cells• Experimental mRNA sample & reference sample

are fluorescently labelled (Cy3-green, Cy5-red)• RNA samples are hybridized onto slide – bind to

complementary DNA• Laser scanning – fluorescent labels allow relative

levels of bound mRNA to be measured• Gridding, background correction, log-ratio

transformation, normalization, analysis (finally!)

• Red = low expression relative to reference• Green = high expression relative to reference• Yellow = similar expression in two samples• Black = no expression in either sample

Example

• DLBCL – Diffuse large B-cell lymphoma (Alizadeh et al, 2000)

• ~18,000 genes x 96 samples of normal and malignant leukocytes

• Clinical evidence of great heterogeneity in terms of survival

• Question: Are there subclasses of DLBCL that can be discovered by looking at gene expression profiles? (Answer: yes)

Why cluster?

• Very large numbers of genes and highly complex systems/pathways render clustering essential for interpretation and visualization

• Discover new tumor subclasses

• Describe common expression profiles (e.g., cell cycle)

What to cluster?

• Clustering genes:– Look for groups of genes with similar expression

profiles – may identify genes that are involved in biochemical pathways

• Clustering samples:– Do clusters conform to known categories?

– Can new structure be discovered (e.g., new subclasses of tumor)?

• Clustering both genes and samples at once

Clustering methods

• Hierarchical (agglomerative) – most common• K-means, PAM• Self-organizing maps (SOMs)• PCA clustering• Ensemble methods• “Fuzzy” methods – genes can belong to more than

one cluster• Model-based methods (e.g., mixtures of

Gaussians)

Challenges

• Noisy data in highly dimensional space

• Many choices of algorithm and algorithmic parameters– What distance measure?– What linkage?– How many clusters?

• How can we assess quality/reliability?

• Two main sample clusters can be seen

• Genes correspond to two different types of B-cell

• Clusters are associated with differential survival beyond traditional clinical indicators

Conclusions

• To be useful, clustering of microarray data must ultimately be informed by biology

• Large number of genes and complexity of pathways means clustering is an essential part of most microarray analyses

• There is no “best” method – choices of distance, linkage, algorithm, gene filtering criteria. As much art as science

clustering of dna microarray data michael slifker cis 526

Documents