introduction to machine learning bmi/ibgp 730 kun huang department of biomedical informatics the...
TRANSCRIPT
- Slide 1
- Introduction to Machine Learning BMI/IBGP 730 Kun Huang Department of Biomedical Informatics The Ohio State University
- Slide 2
- Machine Learning Statistical learning Artificial intelligence Pattern recognition Data mining
- Slide 3
- Machine Learning Supervised Unsupervised Semi-supervised Regression
- Slide 4
- Clustering and Classification Preprocessing Distance measures Popular algorithms (not necessarily the best ones) More sophisticated ones Evaluation Data mining
- Slide 5
- - Clustering or classification? - Is training data available? - What domain specific knowledge can be applied? - What preprocessing of data is needed? - Log / data scale and numerical stability - Filtering / denoising - Nonlinear kernel - Feature selection (do I need to use all the data?) - Is the dimensionality of the data too high?
- Slide 6
- -Accuracy vs. generality -Overfitting -Model selection Model complexity Prediction error Training sample Testing sample (reproduced from Hastie et.al.)
- Slide 7
- How do we process microarray data (clustering)? - Feature selection genes, transformations of expression levels. - Genes discovered in the class comparison (t-test). Risk: missing genes. - Iterative approach : select genes under different p- value cutoff, then select the one with good performance using cross-validation. - Principal components (pro and con). - Discriminant analysis (e.g., LDA).
- Slide 8
- - Dimensionality Reduction - Principal component analysis (PCA) - Singular value decomposition (SVD) - Karhunen-Loeve transform (KLT) Basis for P SVD
- Slide 9
- - Principal Component Analysis (PCA) - Other things to consider - Numerical balance/data normalization - Noisy direction - Continuous vs. discrete data - Principal components are orthogonal to each other, however, biological data are not - Principal components are linear combinations of original data - Prior knowledge is important - PCA is not clustering!
- Slide 10
- Visualization of Microarray Data Multidimensional scaling (MDS) High-dimensional coordinates unknown Distances between the points are known The distance may not be Euclidean, but the embedding maintains the distance in a Euclidean space Try different dimensions (from one to ???) At each dimension, perform optimal embedding to minimize embedding error Plot embedding error (residue) vs. dimension Pick the knee point
- Slide 11
- Visualization of Microarray Data Multidimensional scaling (MDS)
- Slide 12
- Distance Measure (Metric?) -What do you mean by similar? -Euclidean -Uncentered correlation -Pearson correlation
- Slide 13
- Distance Metric -Euclidean 102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300 3189.0001321.3002164.400868.600185.300266.4002527.800 160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000 5410.9003162.1004100.9004603.2006066.2005505.8005702.700 d E (Lip1, Ap1s1) = 12883
- Slide 14
- Distance Metric -Pearson Correlation 102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300 3189.0001321.3002164.400868.600185.300266.4002527.800 160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000 5410.9003162.1004100.9004603.2006066.2005505.8005702.700 d P (Lip1, Ap1s1) = 0.904
- Slide 15
- Distance Metric -Pearson Correlation r = 1r = -1 Ranges from 1 to -1.
- Slide 16
- Distance Metric -Uncentered Correlation 102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300 3189.0001321.3002164.400868.600185.300266.4002527.800 160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000 5410.9003162.1004100.9004603.2006066.2005505.8005702.700 d u (Lip1, Ap1s1) = 0.835 About 33.4 o
- Slide 17
- Distance Metric -Difference between Pearson correlation and uncentered correlation 102123_atLip11596.0002040.9001277.0004090.5001357.6001039.2001387.300 3189.0001321.3002164.400868.600185.300266.4002527.800 160552_atAp1s14144.4003986.9003083.1006105.9003245.8004468.4007295.000 5410.9003162.1004100.9004603.2006066.2005505.8005702.700 Pearson correlation Baseline expression possible Uncentered correlation All are considered signals
- Slide 18
- Distance Metric -Difference between Euclidean and correlation
- Slide 19
- Distance Metric -PCC means similarity, how can we transform it to distance? -1-PCC -Negative correlation may also mean close in signal pathway (1-|PCC|, 1-PCC^2)
- Slide 20
- Supervised Learning Perceptron neural networks
- Slide 21
- Supervised Learning Perceptron neural networks
- Slide 22
- -Supervised Learning -Support vector machines (SVM) and Kernels -Only (binary) classifier, no data model
- Slide 23
- -Supervised Learning - Nave Bayesian classifier -Bayes rule -Maximum a posterior (MAP) Prior prob. Conditional prob.
- Slide 24
- - Dimensionality reduction: linear discriminant analysis (LDA) B. 2.0 1.5 1.0 0.5 0.5 1.0 1.5 2.0............. A w. (From S. Wus website)
- Slide 25
- Linear Discriminant Analysis B. 2.0 1.5 1.0 0.5 0.5 1.0 1.5 2.0............. A w. (From S. Wus website)
- Slide 26
- -Supervised Learning - Support vector machines (SVM) and Kernels -Kernel nonlinear mapping
- Slide 27
- How do we use microarray? Profiling Clustering Cluster to detect patient subgroups Cluster to detect gene clusters and regulatory networks
- Slide 28
- Slide 29
- How do we process microarray data (clustering)? - Unsupervised Learning Hierarchical Clustering
- Slide 30
- How do we process microarray data (clustering)? -Unsupervised Learning Hierarchical Clustering Single linkage: The linking distance is the minimum distance between two clusters.
- Slide 31
- How do we process microarray data (clustering)? -Unsupervised Learning Hierarchical Clustering Complete linkage: The linking distance is the maximum distance between two clusters.
- Slide 32
- How do we process microarray data (clustering)? -Unsupervised Learning Hierarchical Clustering Average linkage/UPGMA: The linking distance is the average of all pair-wise distances between members of the two clusters. Since all genes and samples carry equal weight, the linkage is an Unweighted Pair Group Method with Arithmetic Means (UPGMA).
- Slide 33
- How do we process microarray data (clustering)? -Unsupervised Learning Hierarchical Clustering Single linkage Prone to chaining and sensitive to noise Complete linkage Tends to produce compact clusters Average linkage Sensitive to distance metric
- Slide 34
- -Unsupervised Learning Hierarchical Clustering
- Slide 35
- Dendrograms Distance the height each horizontal line represents the distance between the two groups it merges. Order Opensource R uses the convention that the tighter clusters are on the left. Others proposed to use expression values, loci on chromosomes, and other ranking criteria.
- Slide 36
- -Unsupervised Learning - K-means -Vector quantization -K-D trees -Need to try different K, sensitive to initialization
- Slide 37
- -Unsupervised Learning - K-means [cidx, ctrs] = kmeans(yeastvalueshighexp, 4, 'dist', 'corr', 'rep',20); K Metric
- Slide 38
- -Unsupervised Learning - K-means -Number of class K needs to be specified -Does not always converge -Sensitive to initialization
- Slide 39
- -Unsupervised Learning - K-means
- Slide 40
- -Unsupervised Learning -Self-organized maps (SOM) -Neural network based method -Originally used as a visualization method for visualize (embedding) high-dimensional data -Also related vector quantization -The idea is to map close data points to the same discrete level
- Slide 41
- -Issues -Lack of consistency or representative features (5.3 TP53 + 0.8 PTEN doesnt make sense) -Data structure is missing -Not robust to outliers and noise DHaeseleer 2005 Nat. Biotechnol 23(12):1499-501
- Slide 42
- -Model-based clustering methods (Han) http://www.cs.umd.edu/~bhhan/research2.html Pan et al. Genome Biology 2002 3:research0009.1 doi:10.1186/gb-2002-3-2-research0009
- Slide 43
- -Structure-based clustering methods
- Slide 44
- Data Mining is searching for knowledge in data Knowledge mining from databases Knowledge extraction Data/pattern analysis Data dredging Knowledge Discovery in Databases (KDD)
- Slide 45
- The process of discovery Interactive + Iterative Scalable approaches
- Slide 46
- Popular Data Mining Techniques Clustering: Most dominant technique in use for gene expression analysis in particular and bioinformatics in general. Partition data into groups of similarity Classification: Supervised version of clustering technique to model class membership can subsequently classify unseen data. Frequent Pattern Analysis A method for identifying frequently re-curring patterns (structural and transactional). Temporal/Sequence Analysis Model temporal data wavelets, FFT etc. Statistical Methods Regression, Discriminant analysis
- Slide 47
- Summary A good clustering method will produce high quality clusters with high intra-class similarity low inter-class similarity The quality of a clustering result depends on both the similarity measure used by the method and its implementation. Other metrics include: density, information entropy, statistical variance, radius/diameter The quality of a clustering method is also measured by its ability to discover some or all of the hidden patterns.
- Slide 48
- Recommended Literature 1. Bioinformatics The Machine Learning Approach by P. Baldi & S. Brunak, 2 nd edition, The MIT Press, 2001 2. Data Mining Concepts and Techniques by J. Han & M. Kamber, Morgan Kaufmann Publishers, 2001 3. Pattern Classification by R. Duda, P. Hart and D. Stork, 2 nd edition, John Wiley & Sons, 2001 4. The Elements of Statistical Learning by T. Hastie, R. Tibshirani, J. Friedman, Springer-Verlag, 2001