knowledge-based analysis of microarray gene expression data by using support vector machines

26
Knowledge-based analysis of microar ray gene expression data by using s upport vector machines Michael P. S. Brown*, William Noble Grundy†‡, David Li n*, Nello Cristianini§, Charles Walsh Sugnet¶, Terrenc e S. Furey*, Manuel Ares, Jr.¶, and David Haussler* *Department of Computer Science and ¶Center for Molecu lar Biology of RNA, Department of Biology, University of California, Santa Cruz, Santa Cruz, CA 95064; †Department of Computer Science, Columbia Unive rsity, New York, NY 10025; §Department of Engineering Mathematics, University of Bristol, Bristol BS8 1TR, U nited Kingdom • Advisor:Dr.Hsu Reporter:Hung Ching-wen

Upload: danika

Post on 03-Feb-2016

45 views

Category:

Documents


0 download

DESCRIPTION

Knowledge-based analysis of microarray gene expression data by using support vector machines. Michael P. S. Brown*, William Noble Grundy†‡, David Lin*, Nello Cristianini§, Charles Walsh Sugnet¶, Terrence S. Furey*, Manuel Ares, Jr.¶, and David Haussler* - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Knowledge-based analysis of microarray gene expression data by using support vector machines

Knowledge-based analysis of microarray gene expre

ssion data by using support vector machines • Michael P. S. Brown*, William Noble Grundy†‡, David Lin*, Nell

o Cristianini§, Charles Walsh Sugnet¶, Terrence S. Furey*,• Manuel Ares, Jr.¶, and David Haussler*• *Department of Computer Science and ¶Center for Molecular Biolo

gy of RNA, Department of Biology, University of California, Santa Cruz, Santa Cruz, CA

• 95064; †Department of Computer Science, Columbia University, New York, NY 10025; §Department of Engineering Mathematics, University of Bristol, Bristol BS8 1TR, United Kingdom

• Advisor:Dr.Hsu• Reporter:Hung Ching-wen

Page 2: Knowledge-based analysis of microarray gene expression data by using support vector machines

Outline• Motivation• Objective• A unsupervised learning method.• A supervised learning method.• Experiment data• DNA Microarray Data• Support Vector Machine• Kernel• An imbalance in the number of positive and negative• Experimental Design

• Performance• Results and Discussion• Conclusions

• Opinion

Page 3: Knowledge-based analysis of microarray gene expression data by using support vector machines

Motivation

• DNA microarray technology can provide the ability to measure the expression levels of thousands of genes in a single experiment

• The experiments suggest that genes of similar function yield similar expression patterns in microarray hybridization experiments.

Page 4: Knowledge-based analysis of microarray gene expression data by using support vector machines

Objective• We introduce a method of functiona

lly classifying genes by using gene expression data from DNA microarray hybridization experiments.

• The method is support vector machine (SVM). SVM is a supervised computer learning method.(with prior knowledge of the true functional classes of the genes.)

Page 5: Knowledge-based analysis of microarray gene expression data by using support vector machines

A unsupervised learning method• Unsupervised gene expression analysi

s methods use with similarity (or a measure of distance) between expression patterns

• without prior knowledge of the true functional classes of the genes.

• A clustering algorithm such as hierarchical clustering or selforganizing maps

Page 6: Knowledge-based analysis of microarray gene expression data by using support vector machines

A supervised learning method.

• A supervised learning techniques would begin with a set of genes that have a common function:for example, genes coding for ribosomal proteins

• A training set with two classes of genes expression data:the functional class(positive) and the un-functional class (negative)

Page 7: Knowledge-based analysis of microarray gene expression data by using support vector machines

A supervised learning method

• Using this training set, SVM would learn to discriminate between the positive and negative of a given functional class based on expression data.

• Having learned the expression features of the class, the SVM could recognize new genes as positive or negative of the class based on their expression data.

Page 8: Knowledge-based analysis of microarray gene expression data by using support vector machines

Experiment data• We analyze expression data from 2,467 gen

es from the budding yeast genes measured in 79 different DNA microarray hybridization experiments.

• We learn to recognize five functional classes from MYGD.

• We subject these data to analyses by SVM, Fisher’s linear discriminant, Parzen windows, and two decision tree learners

Page 9: Knowledge-based analysis of microarray gene expression data by using support vector machines

DNA Microarray Data• DNA Microarray Data. Each data poin

t produced by a DNA microarray hybridization experiment represents the ratio of expression levels of a particular gene under two different experimental conditions

Page 10: Knowledge-based analysis of microarray gene expression data by using support vector machines

DNA Microarray Data• 生物晶片室所使用的微點陣技術是以 arrayer微陣列晶片製作儀將數千至上萬個基因探針 (cDNA、 oligonucleotide),依特定的排列方式固定在玻璃玻片上形成 DNA晶片 (DNA chip),再將 target RNA(/DNA) (control and reference) 經不同螢光標記後與 DNA晶片上的基因探針進行雜合 (hybridization),藉由螢光掃瞄分析儀及分析軟體判讀雜交訊號並得到各基因表現強弱之數據,最後藉由電腦分析軟體及資料庫中快速地獲得大量生物資訊。

Page 11: Knowledge-based analysis of microarray gene expression data by using support vector machines

DNA Microarray Data• the expression vector X= (X1, . . . , X79)• The expression level Ei for gene X in experiment I and

the expression level Ri of gene X in the reference state.

• The data set: 79-element gene expression vectors for 2,467 yeast genes

Page 12: Knowledge-based analysis of microarray gene expression data by using support vector machines

Support Vector Machines

• SVM is a simple way to build a binary classifier is to construct a hyperplane separating positive from negative in this space.

• Unfortunately, most real-world problems involve nonseparable data.

• One solution to the inseparability problem is used with kernel to map the data into a higher-dimensional space

Page 13: Knowledge-based analysis of microarray gene expression data by using support vector machines

kernel

• the simplest kernel K(x,y)=X. Y• K (X, Y) =(X. Y+1)², yields a quadratic

separating surface• K (X, Y) =(X. Y+1)³

Page 14: Knowledge-based analysis of microarray gene expression data by using support vector machines

An imbalance in the number of positive and negative

• It is likely to cause the SVM to make incorrect classifications.

• We sovle this problem by modifying the matrix of kernel values computed during SVM optimization.

• X(1), . . . , X(n) be the genes in the training set, the matrix K=﹝kij﹞, kij=k(X(i),X(j)) k is kernel

• Kij =Kij + λ (n*/N), n* is the number of positive,N is the total number, λ is scale factor

• For negative example : n* replaced by n-

Page 15: Knowledge-based analysis of microarray gene expression data by using support vector machines

Experimental Design• Using the class definitions made by the

MYGD, we trained SVMs to recognize six functional classes:tricarboxylic acid (TCA) cycle, respiration, cytoplasmic ribosomes, proteasome, histones, and helix-turn-helix proteins.

• The performance of the SVM classifiers was compared with that of four standard machine learning algorithms: Parzen windows, Fisher’s linear discriminant, and two decision tree learners (C4.5and MOC1).

Page 16: Knowledge-based analysis of microarray gene expression data by using support vector machines

Experimental Design

• Performance was tested by using a three-way cross-validated experiment. The gene expression vectors were randomly divided into three groups.

• Classifiers were trained by using two-thirds of the data and were tested on the remaining third.

• This procedure was then repeated two more times, each time using a different third of the genes as test genes.

Page 17: Knowledge-based analysis of microarray gene expression data by using support vector machines

Performance• Performance:false positive (FP), false

negative(FN), true positive (TP), and true negative (TN)

• overall performance:C(M)= fp(M)+ 2fn(M), fp(M) is the number of false positives for method M, and fn(M) is the number of false negatives for method M.

• S(M) =C(N) -C(M). N:classifies all test examples as negative.

Page 18: Knowledge-based analysis of microarray gene expression data by using support vector machines

Results and Discussion(SVMs Outperform Other Methods)

Page 19: Knowledge-based analysis of microarray gene expression data by using support vector machines

Results and Discussion(SVMs Outperform Other Methods)

• For every class (except the helix-turn-helix class), the best performing method is a support vector machine using the radial basis or a higher-dimensional dot product kernel.

• But the results also show the inability of all classifiers to learn to recognize genes that produce helix-turn-helix proteins, as expected.(s(M) 0)﹤

Page 20: Knowledge-based analysis of microarray gene expression data by using support vector machines

Results and Discussion (Significance of Consistently Misclassified

Annotated Genes.)

Page 21: Knowledge-based analysis of microarray gene expression data by using support vector machines

Results and Discussion (Significance of Consistently Misclassified

Annotated Genes.)

• Many of the false positive genes in Table 2 are known from biochemical studies to be important for the functional class assigned by the SVM, even though MYGD has not included these genes intheir classification. For example, YAL003W and YPL037C,

Page 22: Knowledge-based analysis of microarray gene expression data by using support vector machines

Results and Discussion(Functional Class Predictions for

Genes of Unknown Function.)

• The predictions below may merit experimental testing. In some cases described in Table 3, additional information supports the prediction. For example, a recent annotation shows that a gene predicted to be involved in respiration, YPR020W, is a subunit of the ATP synthase complex, confirming this prediction

Page 23: Knowledge-based analysis of microarray gene expression data by using support vector machines

Conclusions

• We have demonstrated that support vector machines can accurately classify genes into some functional categories and have made predictions aimed at identifying the functions of unannotated yeast genes.

• SVMs that use a higher-dimensional kernel function provide the best performance.

Page 24: Knowledge-based analysis of microarray gene expression data by using support vector machines

Conclusions• The supervised learning framework al

lows a researcher to start with a set of interesting genes and ask two questions: What other genes are coexpressed with my set? And does my set contain genes that do not belong? This ability to focus on the key genes is fundamental to extracting the biological meaning from genome-wide expression data.

Page 25: Knowledge-based analysis of microarray gene expression data by using support vector machines

Conclusions• It is not clear how many other functional ge

ne classes can be recognized from mRNA expression data by SVM .

• We caution that several of the classes were selected based on evidence that they clustered using the mRNA expression vectors

• Other functional classes may require different mRNA expression experiments, or may not be recognizable at all from mRNA expression data alone.

Page 26: Knowledge-based analysis of microarray gene expression data by using support vector machines

Opinion

• SVM is a powerful binary classifier.

• It is important to construct a kernel function and need a good domain knowledge.

• An imbalance in the number of positive and negative training set is a good research.