heuristic pca based feature extraction and its application to bioinformatics

Download Heuristic PCA Based Feature Extraction  and  Its Application to Bioinformatics

If you can't read please download the document

Upload: y-h-taguchi

Post on 09-Jun-2015

1.272 views

Category:

Education


0 download

DESCRIPTION

Presentation at "New Developments of Multivariate Statistical Methodologies -Robust, High Speed, and High-Accuracy" 25th-27th Nov 2014, Tsukuba Univ,, Japan, http://www.math.tsukuba.ac.jp/~aoshima-lab/symposium.html Book chapter is here https://www.researchgate.net/publication/271198208_Heuristic_Principal_Component_Analysis-Based_Unsupervised_Feature_Extraction_and_Its_Application_to_Bioinformatics

TRANSCRIPT

  • 1. Heuristic PCA Based Feature Extraction and Its Application to Bioinformatics Y-h. Taguchi, Dept. Phys., Chuo Uinv., Y. Murakami, Grad. Sch. Med., Osaka City Univ.M. Iwadate, Dept. Biol. Sci., Chuo Univ. H. Umeyama, Dept. Biol. Sci., Chuo Univ. A. Okamoto, Dept. Sch. Health Sci., Aichi Univ. Edu.

2. 0. Why PCA? PCA = principal component analysis Motivation: Unsupervised Feature Selection How PCA? 3. 10 Ordered Features 90 random Features100 Features20 samples Class 1 Class 2 11111111110000000000 11111111110000000000 . . 11111111110000000000 01000000110110011111 00011110000101011101 . . . 01000011000110101111 How to select 10 ordered features, without classification information? 4. Embedding 100 features into 2D using PCA 90 random Features10 Ordered Features 5. PC1 represents discrimination between class 1 and class 2Class 1Class 220 samples 6. Applying weak unitary transformation to the space spanned by 20 samples... 20 samples20 samples 100 FeaturesClass 1 Class 2 10 Ordered Features 90 random FeaturesClass 1 Class 2 7. The same 2D embedding. Thus we can select 10 features.10 Ordered Features90 random Features 8. PC1 weakly represents discrimination between class 1 and class 2Class 1Class 220 samples 9. Linear discriminant analysis + leave one out cross validation using 10 ordered features .True class 1 2 Predict 1 8 2 228 Accuracy=Sensitivity=Specificity=80%How about real examples? 10. 1. Real example 1: Disease associated aberrant promoter methylation methylation gene promoter three autoimmune diseases SLE RA DM [ MZ twins (healthy+sick) + 2 healthy controls] 5 = 20 samples 3 diseases = 60 samples vs 1000 potential methylation sites 11. Embedding of 1000 promoters within 20 RA samples into 2D with PCA (PC2 vs PC3)PC3 Outlier promoters, SelectedPC2 12. PC2:RA Male Female Sick Twin Healthy Twin +:Healthy Control 1 :Healthy Control 2 Twins: Healthy > Sick Controls: No The 4th set: No The reason why unsupervised feature selection is needed.20 samples 13. Scatter plots between healthy/RA twins. Red dots = selected promoters Healthy twins RA twins P