2015-12-4yan-qing zhang, georgia state university1 yanqing zhang department of computer science...
TRANSCRIPT
23/4/21 Yan-Qing Zhang, Georgia State University 1
Yanqing Zhang
Department of Computer ScienceGeorgia State UniversityAtlanta, GA 30302-5060
Fuzzy Machine Learning Methods for Biomedical Data Analysis
23/4/21 Yan-Qing Zhang, Georgia State University 2
Outline
• Background
• Fuzzy Association Rule Mining for Decision Support (FARM-DS)
• FARM-DS on Medical Data
• FARM-DS on Microarray Expression Data
• Fuzzy-Granular Gene Selection on Microarray Expression Data
• Conclusion and Future Work
23/4/21 Yan-Qing Zhang, Georgia State University 3
Background• Theory
– Computational Intelligence, Granular Computing, Fuzzy Sets– Knowledge Discovery and Data mining (KDD)– Decision Support system (DS)– Rule-Based Reasoning (RBR), Association Rule Mining
• Application– Bioinformatics, Medical Informatics, etc.
• Concern– Accuracy– Interpretability
23/4/21 Yan-Qing Zhang, Georgia State University 4
Outline
• Background
• Fuzzy Association Rule Mining for Decision Support (FARM-DS)
• FARM-DS on Medical Data
• FARM-DS on Microarray Expression Data
• Fuzzy-Granular Gene Selection on Microarray Expression Data
• Conclusion and Future Work
23/4/21 Yan-Qing Zhang, Georgia State University 5
Motivation – deal with numeric data
• Fuzzy Logic– Feature transform– Fuzzy AR mining
(Zadeh, 1965)
• Traditional Association rule mining algorithm – If X, then Y– Conf = Pr(Y|X) Supp = Pr(X and Y)– don’t work on numeric data
23/4/21 Yan-Qing Zhang, Georgia State University 6
Motivation – decision support
• FARs for classification – Accuracy vs. Interpretability
• Very Few works– Hu et al. 2002
• Combinatorial rule explosion – Chatterjee et al. 2004
• Human intervention
23/4/21 Yan-Qing Zhang, Georgia State University 7
FARM-DS
• Target– Numeric data– Binary classification
• Effectiveness– Accuracy– Interpretability
• Modeling process– Training– Testing
23/4/21 Yan-Qing Zhang, Georgia State University 8
Step 1: Fuzzy Interval Partition
• 1-in-1-out 0-order TSK model
• ANFIS for model optimization and parameter selection (Jang, 1993)
23/4/21 Yan-Qing Zhang, Georgia State University 9
Step 2: Data Abstraction
• Clustering– K-Means
– Fuzzy C-means
• Validation– #clusters
– Optimal cluster– Silhouette Value
negative cluster
positive cluster
))),(min(),(max(
)()),(min()(
kibia
iakibiS
23/4/21 Yan-Qing Zhang, Georgia State University 10
Step 3: Generating Fuzzy Discrete
Transactions • Project the center of each
cluster on each feature• Create transactions
– With positive cluster, +1 is inserted
– With negative cluster, -1 is inserted
i0"."offormthewithnstransactiotheintoinsertedisthen
,if
i1"."offormthewithnstransactiotheintoinsertedisthen
,if
.nstransactiotheintoinsertednotisthen
,if
i
highlow
i
lowhigh
i
lowhigh
f
f
f
23/4/21 Yan-Qing Zhang, Georgia State University 11
Step 3 - example • 5-2 = 3 transactions– 1 f1_1– 1 f1_1– 1 f1_1
f1
f2
• Avoid combinatorial rule explosion– Number of different transactions are decided by number of clusters
23/4/21 Yan-Qing Zhang, Georgia State University 12
Step 4: Association Rule Mining • Association Rule Mining on fuzzy discrete transactions
– Traditional Apriori algorithm (Agrawal and Srikant 1994)
If f1 is low, f2 is high, …, fh is low, then y=1/-1
• Rule pruning:– For a pair of rules A and B, if B is more specific than A (that
means A is included by B), and B has the same support value as A, A is eliminated.
A: If f1 is low, then y=1, sup=50%
B: If f1 is low and f2 is high, then y=1, sup=50%
23/4/21 Yan-Qing Zhang, Georgia State University 13
Testing Phase
23/4/21 Yan-Qing Zhang, Georgia State University 14
Adaptive FARM-DS
• Train
1. Fuzzy intervals partition2. Data abstraction3. Generate fuzzy discrete
transactions4. AR mining
• Test
He, et al. 2006a, IJDMB
23/4/21 Yan-Qing Zhang, Georgia State University 15
Outline
• Background
• Fuzzy Association Rule Mining for Decision Support (FARM-DS)
• FARM-DS on Medical Data
• FARM-DS on Microarray Expression Data
• Fuzzy-Granular Gene Selection on Microarray Expression Data
• Conclusion and Future Work
23/4/21 Yan-Qing Zhang, Georgia State University 16
Empirical Studies
• Classification algorithms
– C4.5 decision trees (Quinlan, 1993)
– Support vector machines (Vapnik, 1995)
– FARM-DS (He, et al. 2006a, IJDMB)
• Accuracy Estimation– 5-folds cross validation
• Interpretability
23/4/21 Yan-Qing Zhang, Georgia State University 17
Evaluation metrics
Bradley, 1997
• Accuracy– Classification Error
– Area under ROC curve (future work)
• Interpretability– Rule numbers
– Average rule lengths
23/4/21 Yan-Qing Zhang, Georgia State University 18
Datasets
Merz, et al. UCI repository of machine learning databases, 1998
23/4/21 Yan-Qing Zhang, Georgia State University 19
Result analysis on Accuracy
• FARM-DS ≈ SVM > C4.5– SVM2 and C4.5 results from (Bennett et al. 1997)
23/4/21 Yan-Qing Zhang, Georgia State University 20
Result analysis on Interpretability
• SVM, high accuracy, hard to interpret
• C4.5, low accuracy , easy to interpret
• FARM-DS, high accuracy, easy to interpret
23/4/21 Yan-Qing Zhang, Georgia State University 21
Interpretability (1)
• FARs extracted by FARM-DS are short and compact, and hence, easy to understand.
– 22 positive rules and 8 negative rules are extracted.
– In average, • the length of a positive rule is 2.6, • the length of a negative rule is 4.3, • and every sample activates
– 3.3 positive rules and – 5.6 negative rules.
23/4/21 Yan-Qing Zhang, Georgia State University 22
Interpretability (2)• FARs may help human experts to correct the
wrongly classified samples.
23/4/21 Yan-Qing Zhang, Georgia State University 23
Interpretability (3)• The larger support of the negative rules may help
human experts to make final correct decisions and find inherent disease-resulting mechanisms.
23/4/21 Yan-Qing Zhang, Georgia State University 24
Interpretability (4)
• FARs are helpful to select important features.– Higher activation frequency means more
important feature
23/4/21 Yan-Qing Zhang, Georgia State University 25
Outline
• Background
• Fuzzy Association Rule Mining for Decision Support (FARM-DS)
• FARM-DS on Medical Data
• FARM-DS on Microarray Expression Data
• Fuzzy-Granular Gene Selection on Microarray Expression Data
• Conclusion and Future Work
23/4/21 Yan-Qing Zhang, Georgia State University 26
Microarray Expression Data
• Extremely high dimensionality• Gene selection• Cancer classification• Rule-based reasoning
23/4/21 Yan-Qing Zhang, Georgia State University 27
Empirical Studies
• Rule-Based Reasoning/Classification
– CART for decision trees modeling (Breiman, et al. 1984)
– ANFIS for fuzzy neural networks modeling (Jang, 1993)
– FARM-DS (He, et al. 2006a, IJDMB)
23/4/21 Yan-Qing Zhang, Georgia State University 28
Evaluation metrics
Bradley, 1997
• Accuracy– Classification Error– Area under ROC curve– Accuracy Estimation
• Leave-one-out cross validation
• Interpretability– Rule numbers– Average rule lengths
23/4/21 Yan-Qing Zhang, Georgia State University 29
AML/ALL leukemia dataset
Tang, et al. 2006
23/4/21 Yan-Qing Zhang, Georgia State University 30
Result analysis:AML/ALL leukemia dataset
• Higher accuracy than CART• Easier to interpret than ANFIS
23/4/21 Yan-Qing Zhang, Georgia State University 31
Rules extracted by FARM-DS:AML/ALL leukemia dataset
• IF – gene2 (Y12670),– gene3 (D14659) and – gene5 (M80254) are down-regulated,
• THEN the tissue is ALL(-1)
23/4/21 Yan-Qing Zhang, Georgia State University 32
Prostate cancer dataset
Tang, et al. 2006
23/4/21 Yan-Qing Zhang, Georgia State University 33
Result analysis:prostate cancer dataset
• Higher accuracy than CART• Easier to interpret than ANFIS
23/4/21 Yan-Qing Zhang, Georgia State University 34
Rules extracted by FARM-DS: prostate cancer dataset
23/4/21 Yan-Qing Zhang, Georgia State University 35
Outline
• Background
• Fuzzy Association Rule Mining for Decision Support (FARM-DS)
• FARM-DS on Medical Data
• FARM-DS on Microarray Expression Data
• Fuzzy-Granular Gene Selection on Microarray Expression Data
• Conclusion and Future Work
23/4/21 Yan-Qing Zhang, Georgia State University 36
Gene Selection and Cancer Classification on Microarray Expression Data
• Extremely high dimensionality– AML/ALL leukemia dataset 72 * 7129– no more than 10% relevant genes (Golub, et al. 1999)
• Gene selection– accurate classification– helpful for cancer study
23/4/21 Yan-Qing Zhang, Georgia State University 37
Gene Categorization and Gene Ranking
• Informative genes• Redundant genes• Irrelevant genes• Noisy genes
23/4/21 Yan-Qing Zhang, Georgia State University 38
Information Loss
• Noise– Overfitting themselves– Complementary to redundant/irrelevant
genes– Conflict with informative genes
• Imbalanced gene selection• Inflexibility
How to decrease information loss?
Granulation!
23/4/21 Yan-Qing Zhang, Georgia State University 39
Coarse Granulation with Relevance Indexes
22 /1 iiiR 22 /1 iiiR
•Target: remove irrelevant genes
imbalance
imbalance
balance
•Target: tune thresholds to select genes in balance
23/4/21 Yan-Qing Zhang, Georgia State University 40
Fine Granulation with Fuzzy C-Means Clustering
• clustering in the training samples space
• genes with similar expression patterns have similar functions
• a gene may have multiple functions (Fuzzy works here!)
23/4/21 Yan-Qing Zhang, Georgia State University 41
Conquer with correlation-based Ranking
• Lower-ranked genes are removed as redundant genes
23/4/21 Yan-Qing Zhang, Georgia State University 42
Aggregation with Data Fusion
• Pick up genes from different clusters in balance
• An informative gene is more possible to survive – (due to fuzzy clustering)
23/4/21 Yan-Qing Zhang, Georgia State University 43
Original Gene Set
Relevance Indexes -based pre-filtering
Relevant Gene Set
Fuzzy C-Means Clustering
Gene Cluster 1
Gene Cluster 2
Gene Cluster K
Correlation-based Gene Ranking 1
Correlation-based Gene Ranking 2
Correlation-based Gene Ranking K
Final Gene Set
23/4/21 Yan-Qing Zhang, Georgia State University 44
Empirical Study
• Comparison– Signal to Noise (S2N) (Furey, et al. 2000)– Fuzzy-Granular + S2N
– Fisher Criterion (FC) (Pavlidis, et al. 2001)– Fuzzy-Granular + FC
– T-Statistics (TS) (Duan, et al. 2004)– Fuzzy-Granular + TS
23/4/21 Yan-Qing Zhang, Georgia State University 45
Evaluation Methods
)/(
)/(
)/()(
FNTNTNyspecificit
FPTNTNysensitivit
TPFPFNTNTPTNaccuracy
Metrics Accuracy Sensitivity Specificity Area under ROC curve
Estimation Leave-1-out CV .632 bootstrapping
.632 Perf = 0.368 * training perf + 0.632 * testing perf
23/4/21 Yan-Qing Zhang, Georgia State University 46
prostate cancer dataset
23/4/21 Yan-Qing Zhang, Georgia State University 47
Result analysis:prostate cancer dataset
23/4/21 Yan-Qing Zhang, Georgia State University 48
Colon cancer dataset
23/4/21 Yan-Qing Zhang, Georgia State University 49
Result analysis:colon cancer dataset
23/4/21 Yan-Qing Zhang, Georgia State University 50
Conclusion• High-level data abstraction
– data clustering techniques
• Quantitative data transformed to fuzzy discrete transactions – Fuzzy interval partition – Apriori algorithm for AR mining
• Strong decision support for biomedical study– High accuracy and easy to interpret
• More accurate cancer classification– Eliminate irrelevant/redundant genes to decrease noise– Select informative genes in balance
23/4/21 Yan-Qing Zhang, Georgia State University 51
Future Works
• Applying FARM-DS on other biomedical applications
• Integrating more intelligent data analysis techniques.
• Cloud computing based fuzzy data mining algorithms for big data mining
• GPU based fuzzy data mining algorithms for big data mining
23/4/21 Yan-Qing Zhang, Georgia State University 52
References
• [1] Y. C. He, Y.C. Tang, Y.-Q. Zhang and R. Sunderraman, “Mining Fuzzy Association Rules from Microarray Gene Expression Data for Leukemia Classification,” Proc. of International Conference on Granular Computing (GrC-IEEE 2006), Atlanta, pp. 461-465, May 10-12, 2006.
• [2] Y.C. He and Y.C. Tang, Y.-Q. Zhang and R. Sunderraman, “Adaptive Fuzzy Association Rule Mining for Effective Decision Support in Biomedical Applications,” International Journal of Data Mining and Bioinformatics, Vol. 1, No. 1, pp. 3-18, 2006.
• [3] Y.C. He, Y.C. Tang, Y.-Q. Zhang and R. Sunderraman, “Fuzzy-Granular Gene Selection from Microarray Expression Data,” Proc. of DMB2006 in conjunction with IEEE-ICDM2006, Hong Kong, Dec. 18, 2006, (accepted).
• [4] Y.C. He, Y.C. Tang, Y.-Q. Zhang and R. Sunderraman, “Fuzzy-Granular Methods for Identifying Marker Genes from Microarray Expression Data,” Computational Intelligence for Bioinformatics, Gary B. Fogel, David Corne, and Yi Pan (eds.), IEEE Press, 2007.
23/4/21 Yan-Qing Zhang, Georgia State University 53
Acknowledgments
Thanks goto – Dr. Yuchun Tang – Dr. Yuanchen He
For their hard works on this research project.
23/4/21 Yan-Qing Zhang, Georgia State University 54
Questions?
Comments?