pattern aided regression modeling & pattern aided problem solving guozhu dong professor, phd...

49
Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State University Please cite this paper on CPXR: Guozhu Dong & Vahid Taslimitehrani. Pattern-Aided Regression Modeling and Prediction Model Analysis. diction is difficult, even when is about the past.

Upload: juliet-harrington

Post on 22-Dec-2015

217 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Pattern Aided Regression Modeling & Pattern Aided Problem Solving

Guozhu DongProfessor, PhD

Data Mining Research Lab

CSE & Kno.e.sis Center

Wright State University

Please cite this paper on CPXR: Guozhu Dong & Vahid Taslimitehrani. Pattern-Aided Regression Modeling and Prediction Model Analysis.To appear in IEEE TKDE.

Prediction is difficult, even when it is about the past.

Page 2: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Overview Introduction Pattern aided regression modeling: PXR

new regression model type

Contrast pattern aided regression algorithm: CPXR Diverse predictor-response relationships CPXR(Log): Traumatic brain injury (TBI) and heart failure (HF) outcome

prediction Potential applications of the CPXR methodology Other pattern aided problem solving results/apps

Most are contrast pattern aided results Some use patterns only; some use something extra

Concluding remarks

2

Page 3: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Prediction is difficult

Prediction is difficult, especially if it is about the future Nils Bohr, Nobel laureate in Physics Danish Proverb

Those who have knowledge, don't predict. Those who predict, don't have knowledge. Lao Tzu, 6th Century BC Chinese Philosopher

Prediction is difficult, even when it is about the past. Guozhu Dong

Guozhu Dong: Pattern Aided Regression Modeling 3

Page 4: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Preliminaries on prediction using regression Training dataset: {(xi,yi) | 1 <= i <=n}

xi: vector of predictor variables

yi: value of response variable

Regression model evaluation

Pattern Aided Regression ModelingGuozhu Dong 4

LR: Linear regression

Page 5: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Teaser 1: Performance of contrast pattern aided regression (CPXR)

CPXR: highest accuracy in 41 out of 50 datasets Average RMSE reduction (relative to LR) of 42% in 50 datasets, much higher than that of best competing methodCPXR achieved 60+% RMSE reduction in 10 out the 50. CPXR is better than LR in all 50 datasets.

5

LR: Linear regressionGBM: Gradient Boosting (generalization of AdaBoost)

RMSE reduction =[RMSE(LR)-RMSE(M)]

/ RMSE(LR)

Page 6: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Teaser 2: AUC of ROC curves for CPXR(Log) vs other methods on HF and TBI

classification results

Guozhu Dong: Pattern Aided Regression Modeling 6

HF (with Mayo researchers)TBI

Page 7: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

CPXR is good because

PXR can effectively capture diverse predictor-response relationships to find good PXR models

Definition: The data (for given application) contains diverse predictor-response relationships if it contains different subgroups whose best-fit local models are highly different [Dong+TaslimiteheraniTKDE15]

Diverse predictor-response relationships are the main reason why best state-of-the-art regression methods perform often poorly

Guozhu Dong: Pattern Aided Regression Modeling 7

Page 8: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

PreliminariesPatterns & contrast patterns

Pattern: condition describing set of objects EG: age <= 35 & rank = full professor It describes all full professors with age <=35. A pattern describes a region of data in a low dimensional

subspace, of high dimensional data

Contrast patterns: conditions that distinguish objects in different classes/conditions

Contrast patterns are useful: They are strongly associated with issues of importance

8

Page 9: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Contrast patterns thru example

9

• CP: A1=b & A3=e• It matches all C1 objects.• It matches

no C2 objectsIts mds={t1,t2}Generally: A pattern is CP if it matches many more objects in

one class than in other classes (aka emerging patterns)

mds(P): the set of objects matching P.An equivalence class: A set of patterns with same mds (having

same behavior). Pick one as representative.

Page 10: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Why contrast pattern based approaches are successful & have big potentials? Contrasting is meaningful

Contrastingrisk indicatoradvantages in survival/wellbeing Contrasting is built into human (animal) instincts

Focus on contrast patterns focus on important issues High (3—7) dim contrast patterns capture important novel

multi-variable interactions related to goals Opportunity: Humans often use low dimensional CPs (?

brain’s computing power is low & lack of data?) They use independence assumption for high dimension apps WE WANT TO DO BETTER! WE CAN DO BETTER!!!

10

Page 11: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Pattern aided regression model A PXR model is represented by a tuple

Each Pi is a pattern

Each fi is a local regression model for data satisfying Pi,

fd is a default local regression model

Each wi is weight for Pi

The regression function of PM is defined by

11

Page 12: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

A pictorial illustration of a simple PXR model

Different patterns can involve different sets of variables [describing data regions in different subspaces]

Matching datasets of different patterns can overlapGuozhu Dong: Pattern Aided Regression

Modeling 12

Page 13: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

An example PXR model u,v z: predictor variables, y: response variable ((P1, f1, w1), (P2, f2, w2), fd) gives a PXR model

Pattern Aided Regression ModelingGuozhu Dong 13

Page 14: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Discussion PXR is a strict generalization of Piecewise Linear

Regression (PLR) PLR can be viewed as trying to model diverse PR

relationships, but it is limited in modeling capabilities and computing algorithms [and they didn’t see DPR]

Often a PXR model uses few patterns (e.g. 7) Local regression models

model type can be complex or simple we often use simple ones such as linear or piecewise

linear models

Pattern Aided Regression ModelingGuozhu Dong 14

Page 15: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Diversity of predictor-response relationships Different pattern-model pairs emphasize

different sets of variables Different pattern-model pairs use highly

different regression functions Each pattern-model pair captures a highly

distinct kind of behavior Diverse predictor-response relationships

may be neutralized at the global level

Pattern Aided Regression ModelingGuozhu Dong 15

Page 16: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

DPR in TBI (traumatic brain injury)

Pattern Aided Regression ModelingGuozhu Dong 16

Page 17: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

How CPXR builds PXR models

17

Training Data Dfor regression

Local RegressionModels for CPs

PXR model

D = LE U SELE: Large ErrorSE: Small Error

Baseline RegressionModel f0

Mine CPs RepresentativeCPs of LE & SE

Page 18: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

How CPXR builds PXR models: (D,f0) (LE,SE) Ps and fs PXR Starting with training dataset Build a baseline regression model f0 (or use given f0) Split data into LE (large error) and SE (small error),

based on f’s prediction error Mine CPs; Remove some CPs Build corresponding local regression models for

remaining CPs, and select patterns to construct PXR Many technical details, including variable binning,

splitting data, search objectives, baseline model type, local model type …

Pattern Aided Regression ModelingGuozhu Dong 18

Control overritting: don’t use P if |mds(P)| <= #vars

Page 19: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Summary of empirical results CPXR is highly accurate for building regression

models Outperforms other regression methods, often by big

margins• On accuracy• On overfitting• On. sensitivity to noise

Exp says: Diverse predictor-response relationships occur often in real life, for data with >=3 dimensions

We used 50 real datasets, and 20+ synthetic onesPattern Aided Regression Modeling

Guozhu Dong 19

Page 20: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Previous regression methods Linear regression (LR): uses a linear function Piecewise linear regression (PLR): splits one

variable into intervals, uses a different linear function for each interval

Support vector regression (SVR): SVM like, but minimizing prediction error

Bayesian additive regression trees (BART): ensemble of (hundreds of) decision trees

Neural networks, Gradient Boosting …

Pattern Aided Regression ModelingGuozhu Dong 20

Interpretability is low

Page 21: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Experiments used 50 real datasets used in previous regression studies

Pattern Aided Regression ModelingGuozhu Dong 21

6 example datasets

Page 22: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

CPXR achieved large RMSE reduction (accuracy improvement) consistently

CPXR: highest accuracy in 41 out of 50 datasets (4 competitors)

Average RMSE reduction (relative to LR) of 42% in 50 datasets, much higher than that of best competing methodCPXR achieved 60+% RMSE reduction in 10 out the 50. CPXR is better than LR in all 50 datasets.

22

GBM: Gradient Boosting (generalization of AdaBoost) We also tried other competitors but they are not competitiveCPXR is not better than other methods on random data

RMSE(LR)-RMSE(M)-----------------------------

RMSE(LR)

Page 23: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Box plot of RMSE reduction

Pattern Aided Regression ModelingGuozhu Dong 23

CPXR(LP)’s median > Q3 of PLR,LR,BART

CPXR(LP)’s Q1 > median of PLR,LR,BART

CPXR(LL) is a variant of CPXR

Page 24: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Evaluation on overfitting CPXR is more accurate than PLR, LR and BART on testing

data; its model complexity is fairly low. CPXR has smaller relative accuracy drop (from training to

test)

Pattern Aided Regression ModelingGuozhu Dong 24

Page 25: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Evaluation on sensitivity to noise

Pattern Aided Regression ModelingGuozhu Dong 25

Build PXR on clean training data

Compare accuracies on •training data•noise-added test data

Page 26: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

CPXR’s outperformance vs degree of diversity of PR-Relationships

Pattern Aided Regression ModelingGuozhu Dong 26

PIP: Positive impact pattern Diff = ratio of largest coefficient local model making big improvement of pairs of local models

High: CPXR has large RMSE reduction; Low: CPXR has low RMSE reduction

Page 27: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Diverse Predictor-Response Relationships May Neutralize Each Other at High Level

Example: We considered a set S1 of 4 variables, S2=S1 + 2 variables, on soil water content data

For LR, S2 does not give improvement over S1 Ditto for PLR, SVR, BART

For CPXR, PXR model on S2 gives 20% RMSE improvement over PXR model on S1 The new variables are involved in most of the diverse PR

relationships (the patterns in the PXR model) These relationships somehow cancelled each other’s

effect, at the whole data set level. missed by LR etc.

Pattern Aided Regression ModelingGuozhu Dong 27

Page 28: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Other apps of CPXR, besides building accurate prediction models

Analysis on a given prediction model On what kinds of data it make large prediction errors How to correct those prediction errors

Do important models in science and medicine have systematic mistakes?

Analysis on comparing two given prediction models, w.r.t. their differences

Discovering policy errors, niche opportunities, … Discovering true importance of variables (medicine) Discovering intricate multi-variable interactions ……

Pattern Aided Regression ModelingGuozhu Dong 28

Page 29: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

CPXR for logistic regression modeling and results on outcome prediction for HF/TBI The PXR-CPXR approach is not limited to linear

regression. We adapted it for logistic regression to get CPXR(Log) We used CPXR(Log) for outcome prediction for

traumatic brain injury patients & heart failure patients. CPXR(Log) is much more accurate than standard

logistic regression and SVM CPXR(Log) also identifies important variables that are

considered unimportant by standard logistic regression

Guozhu Dong: Pattern Aided Regression Modeling 29

Page 30: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Results on TBI

Guozhu Dong: Pattern Aided Regression Modeling 30

Page 31: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

AUC of ROC curves for CPXR(Log) and other methods (on HF and TBI)

Guozhu Dong: Pattern Aided Regression Modeling 31

HF TBI

Page 32: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Details on CPXR(Log) Model for TBI

Guozhu Dong: Pattern Aided Regression Modeling 32

Page 33: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Guozhu Dong: Pattern Aided Regression Modeling 33

Page 34: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Work on Heart Failure Patient Risk Prediction

Vahid Taslimitehrani, Guozhu Dong, Naveen L. Pereira, Maryam Panahiazar, Jyotishman Pathak: Developing an EHR-driven Heart Failure Risk Prediction Model using CPXR(Log). Submitted to journal

Mayo’s EHR Data: Patient’s demographic data -- age, gender, race and ethnicity. Lab results -- cholesterol, sodium, hemoglobin and lymphocytes, and EF. Medications -- Angiotensin Converting Enzyme (ACE) inhibitors,

Angiotensin Receptor Blockers (ARBs), β-adrenoceptor antagonists (β-blockers), Statins, and Calcium Channel Blocker (CCB).

26 major chronic conditions (co-morbidities). Many variables; many are important/needed

Guozhu Dong: Pattern Aided Regression Modeling 34

Page 35: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Finished CPXR-based projects/papersupto March 2015 Vahid Taslimitehrani, Guozhu Dong, Naveen L. Pereira, Maryam Panahiazar,

Jyotishman Pathak: Developing an EHR-driven Heart Failure Risk Prediction Model using CPXR(Log). Submitted to journal

Behzad Ghanbarian, Vahid Taslimitehrani, Guozhu Dong, Yakov A. Pachepsky. Measurement Scale Effect on Prediction of Soil Water Retention

Curve and Saturated Hydraulic Conductivity. Submitted.    Vahid Taslimitehrani, Guozhu Dong. A New CPXR Based Logistic

Regression Method and Clinical Prognostic Modeling Results Using the Method on Traumatic Brain Injury. In Proceedings of IEEE International Conference on BioInformatics and BioEngineering (BIBE) 2014

Building accurate loan default risk models for a private company. 2014-2015. Looking for collaborators

Guozhu Dong: Pattern Aided Regression Modeling 35

Page 36: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Summary of strength of PXR/CPXR

More accurate than state-of-the-art methods Philosophy: Using patterns to identify data groups

where given model makes large prediction errors that can be corrected systematically

Using different pattern-model pairs to model diverse predictor-response variable relationships

The approach is better suited to high dimensional data than other methods

Offering insights to mistakes in business strategies …

Guozhu Dong: Pattern Aided Regression Modeling 36

Page 37: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

I have focused on PXR/CPXR. I will now discuss other pattern aided problem solving methods and applications

Other methods: Contrast pattern aided clustering Contrast pattern based classification and improvement of

traditional methods Contrast pattern aided gene ranking for complex diseases Contrast pattern aided outlier detection

Applications: Diagnosis of diseases, study of complex diseases, blog analysis, compound selection for drug design, crime environ analysis, apartment rental price prediction, activity recognition, …

37

Page 38: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

38

We published a book on CDM in 2012; 3 out of 6 parts on applications

1.Preliminaries and Measures on Contrasts2.Contrast Mining Algorithms3.Mining Generalized Contrasts4.Contrast Mining for Classification & Clustering5.Contrast Mining for Bioinformatics & Chemoinformatics6.Contrast Mining for Special Application Domains44 contributing authors, from ~dozen countries; not comprehensiveMethods used by many scientists.

Page 39: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

39

My recent results in this area1.CAEP-style classification: discriminative power aggregation of emerging patterns 2.Outlier detection / intrusion detection: almost model free; using discriminative pattern length3.Clustering quality evaluation using patterns (quality, abundance, diversity): no distance function4.CP based clustering and cluster description: no distance func needed5.Interaction based gene/SNP ranking for complex diseases6.Contrast pattern aided regression: Effectively handling diverse predictor-response relationships

Page 40: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Key Challenges for Pattern Aided Problem Solving

For each problem to solve, our general approach is to use a selected pattern set to help reach our goals.

Q: (1) What kinds of pattern sets? (2) How to use the patterns in the set? (3) How to efficiently search for desired pattern sets?

We need effective techniques There are millions of (contrast) patterns The search space is huge

Pattern Aided Regression ModelingGuozhu Dong 40

Using CPs more efficient

Page 41: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

CPCQ Clustering Quality Index•CPCQ Rationale: A high-quality clustering, capturing natural

concepts in data, should have many diversified high-quality contrast

patterns (CPs) contrasting its clusters.

•A CP characterizes its home cluster and discriminates its home

cluster against other clusters.

•Home cluster of a CP: the cluster where it has highest

frequency among all clusters

•Think of a cluster as a class.

Page 42: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

CPC AlgorithmContrast Pattern Based Clustering – aimed to maximize CPCQ

C2

tuples

1. Select Seed CPs

S1

S2

items .....

S1

S2

2. Assign Patterns to CP Group G1 of Clusters, Using MPQ

3. Assign Patterns as CPs of Clusters, Using Tuple Overlap

4. Assign Tuples to Clusters, Using Tuple Overlap

S1

S2

C1

PS(C1)

PS(C2)

Page 43: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Clustering data using CPC into groups, each having succinct informative group descriptions

43

• EG: Given a collection of texts/blogs (collected at ASU). • We cluster the blogs into four groups

• each group is associated with a small set of patterns • the patterns clearly indicate what the groups are about.

Page 44: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

A special feature is ECP (an adaptation of CAEP)’s ability to accurately classify molecules on the basis of very small training sets containing only a few (e.g. 3 per class) compounds.

This feature is highly relevant for virtual compound screening when very few experimental hits are available as templates.

Reference: Jens Auer et al. Simulation of sequential screening experiments using emerging chemical patterns. Medicinal Chemistry, 4(1):80–90, 2008. [from its abstract]

44

CAEP: Semi-supervised chemical compound screening with few training samples

Page 45: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

IBIG vs IG & FC for Colon Cancer

45

Traditional generanking methods:

Fold change & entropy:rank genes by considering impact of one gene at a time.

Are they suitable for complex diseases?

Genes lowly ranked by FC could be highly ranked by IBIG

Page 46: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Quote from our Contrast Data Mining book

… the most important contribution of contrast mining will come when we no longer need … simplifying approaches to handle … challenge of high dimensional data, when we have developed the methodology to systematically analyze, and accurately use, sets of multi-feature contrast patterns …. … contrast mining has made useful progress …. Success … will have a large impact on … handling of intrinsically complex processes, such as complex diseases whose behaviors are influenced by the interaction of multiple … factors.

46

Page 47: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Potentials (1): Develop New Pattern Aided Methods

Selecting set of patterns can help solve existing challenging problem in much better way Identify such problems, work on them …

Using sets of patterns can lead to systematic ways of handling multiple multi-variable interactions

Contrast mining, pattern aided problem solving, & pattern aided data analytics, have potential to help effectively handle challenges of high dimensional data

47

Page 48: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

Potentials (2): Use Current Pattern Aided Methods to Solve Problems

Our developed methods can help perform regression more accurately, characterize & correct errors of given models, for vital applications in science, medicine, & economics Improving scientific models Changing what we believe in?

Our developed clustering, classification, outlier detection, gene ranking, multiple multi-variable interaction mining/selection methods can help solve challenging problems and offer new insights

48

Page 49: Pattern Aided Regression Modeling & Pattern Aided Problem Solving Guozhu Dong Professor, PhD Data Mining Research Lab CSE & Kno.e.sis Center Wright State

April 19, 2023 49

www.cs.wright.edu/~gdong

QuestionsQuestions

Next step – wish to find collaborators To work on high impact prediction modeling problems