redhyte: towards a self-diagnosing, self-correcting, and helpful analytic platform

REDHYTE:

TOWARDS A SELF-DIAGNOSING, SELF-

CORRECTING, AND HELPFUL ANALYTIC

PLATFORM

Isaac TOH Wei Zhong, CHOI Kwok Pui, WONG Limsoon

National University of Singapore

[email protected]

ACIIDS 2016

8th Asian Conference on Intelligent Information and

Database Systems

14-16 March 2016, Da Nang, Vietnam

mailto:[email protected]

A BIT ON ME

Data Scientist at Singapore Telecommunications Limited

(Singtel)

Background in Computational Biology / Bioinformatics

Data mining, machine learning, R

Delivery of Advanced Analytics to mostly government

agencies in the area of Defence and Public Safety

2

OUTLINE

Introduction

Implementation

Conclusion

3

INTRODUCTION

OVERVIEW

Our objective was to develop a system that complements

the process of analysis of scientific data

In particular, we are focusing on data exploration and

hypothesis testing

5

BACKGROUND

Data analysis: the process of inspecting, visualizing, and

modeling data so as to derive knowledge and insights

One of the primary tools of the scientist and data analyst

is to make comparisons, which is formally known as

hypothesis testing

6

CONVENTIONAL DATA ANALYSIS

Conventional data analysis workflows primarily consist of

these steps:

Have a question in mind

Formulate some assertion or hypothesis

Design an appropriate experiment (Experimental design)

Collect and clean relevant data

Test that hypothesis using the collected data and statistical

techniques, in order to decide whether to reject or not reject it

7


Putting together a hypothesis with a statistical test allows

the analyst to make justifiable conclusions from the data

The process starts from the initial question/hypothesis in

mind

8


In short, conventional data analysis (and science) is expert-

dependent, requiring ample domain knowledge, intuition

and experience

This has been very appropriate and successful in the pre-

”Big Data” era

9

DATA ANALYSIS IN “BIG DATA” CONTEXT

Centralized storage, high quality curation, convenient

retrieval and dissemination (data pipelines)

High throughput assaying (-omics)

Electronic medical and health records (EMR, EHR)

Data is assembled and pulled from wherever we can go our

hands on

Are conventional data analysis approaches (question →

analysis → insights) still viable?

Two phenomena

10

DATA ANALYSIS IN “BIG DATA” CONTEXT: TWO

PHENOMENA

Collection of data without scientific question and

experimental design a priori

In a traditional cohort or cross-section study, subjects are

carefully selected

Assumptions of any statistical tests to be used are met

Routine collection of data (using established data

pipelines) makes it easy for statistical assumptions to be

dissatisfied

E.g. t-test

11

DATA ANALYSIS IN “BIG DATA” CONTEXT: TWO

PHENOMENA

“large p small n”, or having large number of variables or

attributes in datasets

Formulating a hypothesis concerning a small number of

attributes and testing in a large dataset while ignoring other

attributes is

Wasteful

Flawed

12

AN EXAMPLE

Some dataset with 100 attributes lying around in the repository, we are interested in the relationship between a small number of them (guided by domain knowledge)

Statistical test or correlation

Which tests? Or use correlation? Look at the types of the attributes:

Both numerical: correlation

One numerical, one categorical: t-test or ANOVA

Both categorical: χ2-test

Perhaps their non-parametric equivalents?

Get a p-value, and make a conclusion about the initial domain knowledge-driven question

But is this end of the story?13

WHAT COULD GO WRONG?

Violations of statistical tests assumptions

Especially so for datasets collected without prior scientific

questions (vs. traditional scientific studies)

Correlation and statistical tests consider the 2 attributes “in

vacuum”, ignoring the effects of other attributes

Confounding: Simpson’s Paradox (omitted “third variable”)

14

SIMPSON’S PARADOX

Classic example (kidney stones):

Success Failure Total

Treatment A 273 (78.0%) 77 350

Treatment B 289 (82.6%) 61 350

Total 562 138 700

Small stones Success Failure Total

Treatment A 81 (93.1%) 6 87

Treatment B 234 (86.7%) 36 270

Total 315 42 357

Large stones Success Failure Total

Treatment A 192 (73.0%) 71 263

Treatment B 55 (68.8%) 25 80

Total 247 96 343

15

INFAMOUS UC BERKELEY EXAMPLE

Admitted Rejected Total

Males 1198 (44.5%) 1493 (55.5%) 2691

Females 557 (30.4%) 1278 (69.6%) 1835

Total 1755 2771 4526

Dept A Admitted Rejected Total

Males 512 (62.1%) 313 (37.9%) 825

Females 89 (82.4%) 19 (17.6%) 108

Total 601 332 933

16

WHAT COULD BE IMPROVED?

Test diagnostics: checking of tests assumptions

17



Consider the other attributes in the large dataset

To account for confounding

To incorporate information of other attributes into hypothesis:

trend amplification for a certain strata/class of a third

attribute

18

TREND AMPLIFICATION

High income Low income Total

Administrative 507 (13%) 3263 (87%) 3770

Craftsmen 929 (23%) 3170 (77%) 4099

Total 1436 6433 7869

Some college

education

High income Low income Total

Administrative 142 (11%) 3263 (89%) 1281

Craftsmen 241 (28%) 3170 (72%) 868

Total 383 1766 2149

19




Simpson’s Paradox, trend amplification

Hypothesis analysis: looking deeper into hypotheses

20

HYPOTHESIS ANALYSIS

Vaccine Had flu Avoided flu Total

A 43 237 280

B 52 198 250

C 25 245 270

D 48 212 260

E 57 233 290

Total 225 1125 1350

H0: all vaccines are equally effective

Use a χ2-test

21

HYPOTHESIS ANALYSIS (2)

χ2 = 13.803 + 2.761 = 16.564, d.f. = 4

p-value < 0.05

Vaccine Had flu (O-E)2/E Avoided flu (O-E)2/E

A 43 (46.7) 0.293 237 (233.3) 0.059

B 52 (41.7) 2.544 198 (208.3) 0.509

C 25 (45.0) 8.889 245 (225) 1.778

D 48 (43.3) 0.510 212 (216.7) 0.102

E 57 (48.3) 1.567 233 (241.7) 0.313

Total 225 13.803 1125 2.761

22


Vaccine C contributes 64.4% of the χ2 test statistic

Clearly there is something special with vaccine C

Vaccine Had flu (O-E)2/E Avoided flu (O-E)2/E

A 43 (46.7) 0.293 237 (233.3) 0.059

B 52 (41.7) 2.544 198 (208.3) 0.509

C 25 (45.0) 8.889 245 (225) 1.778

D 48 (43.3) 0.510 212 (216.7) 0.102

E 57 (48.3) 1.567 233 (241.7) 0.313

Total 225 13.803 1125 2.761

23



A 43 237 280

B 52 198 250

D 48 212 260

E 57 233 290

Total 225 1125 1350

Without vaccine C, χ2 = 2.983, d.f. = 3

p-value > 0.1, not significant

24



C 25 245 270

A, B, D, E 200 880 1080

Total 225 1125 1350

χ2 = 12.7, d.f. = 1

p-value < 0.001, significant

25

PROBLEMS AND IMPROVEMENTS

Violation of statistical test assumptions



Hypothesis analysis

These are potential problems and improvements that are

often

Undiscovered

Discovered by chance

Most importantly, not addressed in unison

26

BUT HOW?

How do we sieve these out all at once? Do we

a) Continue to rely on domain knowledge?

b) Iteratively go through the other attributes (and their

stratifications) and scrutinize them? Or,

c) Rely on some form of algorithm and automation to sieve

them out automatically, and then fall back again on

domain knowledge to scrutinize the output?

27

DATA MINING

Data mining is a well-established class of techniques

commonly used to search for interesting and global

relationships in large datasets

Classification

Clustering

Frequent pattern mining

Data mining techniques can account for large number of

attributes at once

28

DATA MINING VS. HYPOTHESIS TESTING

Problem with data mining is that it is does not contribute

to the fundamental endeavor of making comparisons

Knowing that an attribute A contributes greatly to the

classification of a response attribute R is not nearly as intuitive

as putting A and R in a contingency table, as below

Gene A Diseased Control Total

Up-reg. 43 27 70

Not up-reg. 12 44 56

Total 55 71 126

29

MOTIVATION

Therefore, in this piece of work we have developed a

system named “Redhyte”

Short for Rapid Exploration of Data and Hypothesis Testing

Using data mining techniques in a specific and novel

manner, Redhyte allows users to remain in the hypothesis

testing framework in a large and possibly unexplored

dataset

30

MOTIVATION

Redhyte first takes in the user’s initial hypothesis, which

may be very general or intuitive

For example, does smoking increase risk of lung cancer?

Using this initial hypothesis, Redhyte

Test, diagnose, analyze the initial hypothesis/test

Generates hypotheses that are potentially interesting to the user


We call the chief objectives of Redhyte “hypothesis

analysis” and “hypothesis mining”

31

IMPLEMENTATION

FRAMEWORK

Main modules

1. Initial test

2. Test diagnostics and hypothesis analysis

3. Context mining

4. Mined hypotheses formulation, scoring, ranking

Other auxiliary functionalities

Data visualizations

Log documentation for reproducibility of results

Hypothesis

mining

34

Self-diagnosing,

self-correcting

1

2

3

4

CONTEXT MINING

Given an initial hypothesis, it is possible to include

additional attributes to make the hypothesis more specific

36

Small stones Success Failure Total

Treatment A 81 (93.1%) 6 87

Treatment B 234 (86.7%) 36 270

Total 315 42 357

CONTEXT MINING

Context mining is concerned with the search for such

attributes, to give the initial hypothesis some “context”

Generates a list of attributes that may be interesting to

consider as mined context attributes

37

CONTEXT MINING

Using classification models from data mining:

Build two classification models, each predicting the target

attribute and the comparing attribute

Target attribute: stipulated in the initial hypothesis, the attribute

that represents outcome/response

Comparing attribute: the basis of comparison

Relapse No Relapse

Treatment

Placebo

38

CLASSIFICATION MODELS IN CONTEXT MINING

From the classification models, we take the top attributes

from each model (if their accuracies are high), and use

them as mined context attributes

39

MINED HYPOTHESES

Using the mined context attributes, we consider each class

within these attributes,

E.g. {Dept = A}, {Dept = B}, … {Dept = F}

E.g. {Kidney stones = small}, etc.

These are the mined context items, to be inserted into

the initial hypothesis to form mined hypotheses

40

INTUITION FOR USING CLASSIFICATION IN

CONTEXT MINING

If some attribute A contributes to the classification of

either the target or the comparing attribute, then A is

somehow associated with either of them

Specifically, adding a particular class of A into the initial

hypothesis may result in trend amplification or reversals

41

CONTEXT MINING

Classification model of choice: random forests

High accuracy

Able to do attribute selection

Able to tolerate levels of class-imbalance better than most

classifiers

Robust to redundancies and multicollinearity

Does not require linearity

Does not require training and testing datasets for cross-

validation

42

SCORING AND RANKING OF MINED

HYPOTHESES

With each mined context items, we compute four different

scores or hypothesis mining metrics:

Difference lift

Contribution

Independence lift

Adjusted independence lift

Each metric was designed to capture specific aspects of

“interestingness” of a mined hypothesis

43

ASPECTS OF INTERESTINGNESS

Trend changes: trend amplification, Simpson’s Reversal

Relative support: if a mined context item shrinks the

subpopulations of the hypothesis too much, it is

considered too specific and less interesting

Shrinkage manner: the way with which the mined context

item shrinks the subpopulations of the initial hypothesis

44

SHRINKAGE MANNER

Arguably, directed shrinkage may be interesting to consider –

especially in a large dataset, where it is possibly unexpected

Undirected Directed

T1 T2

C1 50 20

C2 60 30

T1 T2

C1 45 15

C2 55 25

T1 T2

C1 10 17

C2 50 22

45

CONCLUSION

IN SUMMARY

We have developed Redhyte, an interactive platform for

rapid exploration of data and hypothesis testing

Redhyte is capable of:

Automated statistical test diagnostics and hypothesis analysis

Automated discoveries of Simpson’s Paradoxes, etc. via

hypothesis mining

What’s novel in Redhyte:

χ2 contributions

Hypothesis mining, context mining using classification

New hypothesis mining metrics

48

THANK YOU AND ENJOY THE REST OF

THIS CONFERENCE!

[email protected]

ADDITIONAL SLIDES

Some issues

Examples for terminologies

Test diagnostics and hypothesis analysis

Context mining vs. CMH test

Equivalent methods in context mining

Class-imbalance learning: adjusted geometric mean

Definitions of hypothesis mining metrics

Future work

52

SOME ISSUES

Scalability

Class-imbalance learning

Multiple testing

53

INITIAL HYPOTHESIS (T-TEST)

Hinitial: Is there a difference in AGE when comparing the

samples on WORKCLASS between {State-gov} vs.

{Private}?

Target attribute: Atgt = Age

Comparing attribute, Acmp = Workclass

Initial context, Cinitial= {Workclass = State-gov, Workclass =

Private}

Tinitial: t-test on means of Age

54

INITIAL HYPOTHESIS (T-TEST)

Hinitial : In the context of {GENDER = Male}, is there a

difference in AGE when comparing the samples on

WORKCLASS between {State-gov} vs. {Private}?

Atgt = Age

Acmp = Workclass

Cinitial= {Gender = Male, Workclass = State-gov, Workclass =

Private}

Tinitial: t-test on means of Age

55

INITIAL HYPOTHESIS (Χ2-TEST)


difference in EDUCATION between {Bachelors} vs.

{Masters} when comparing the samples on WORKCLASS

between {State-gov} vs. {Private}

Atgt = Education

Acmp = Workclass

vtgt = Bachelors

Cinitial= {Gender = Male, Education = Bachelors, Education =

Masters, Workclass = State-gov, Workclass = Private}

Tinitial: χ2-test on 2x2 contingency table

56

INITIAL HYPOTHESIS (COLLAPSED Χ2-TEST)


difference in EDUCATION between {Bachelors} vs.

{Masters} when comparing the samples on WORKCLASS

between {State-gov & Federal-gov} vs. {Private}

Atgt = Education

Acmp = Workclass

vtgt = Bachelors

Cinitial= {Gender = Male, Education = Bachelors, Education =

Masters, Workclass = State-gov, Workclass = Federal-gov

Workclass = Private}

Tinitial: χ2-test on 2x2 contingency table

57

TEST DIAGNOSTICS AND HYPOTHESIS ANALYSIS

t-test:

Normality: Shapiro-Wilk test

Equal variances: F-test

Nonparametric alternative: Mann-Whitney U

Collapsed χ2-test:

χ2 contributions

Cochran-Mantel-Haenszel test for independence

assumption

58

CONTEXT MINING VS. CMH TEST

CMH test requires all trends across all k tables to be the

same direction

Multiple testing

59

EQUIVALENT METHODS IN CONTEXT MINING

Could have used regression-based models, e.g. logistic

regression, or even correlation

But with potential multicollinearity, nonlinearity, and class-

imbalance, GLMs are not ideal

Also require model to give some form of variable

importance

60

CLASS-IMBALANCE LEARNING: ADJUSTED

GEOMETRIC MEAN

𝐺𝑚 refers to the geometric mean accuracy of the model, 𝑆𝑃the specificity of the model, and 𝑁𝑛 the proportion of

samples that belong to the majority class

𝐴𝐺𝑚 =𝐺𝑚 + 𝑆𝑃 ⋅ 𝑁𝑛

1 + 𝑁𝑛

𝐺𝑚 = (𝑆𝑃 ⋅SE)0.5

61

DEFINITIONS OF HYPOTHESIS MINING METRICS

I = {Actx = vctx} T1 T2

C1 c'11(p'1) c'12

C2 c'21 (p'2) c'22

Hinitial T1 T2

C1 c11(p1) c12

C2 c21 (p'2) c22

62

DEFINITIONS OF HYPOTHESIS MINING METRICS

𝐷𝑖𝑓𝑓𝐿𝑖𝑓𝑡 𝐼 = {𝐴𝑐𝑡𝑥 = 𝑣𝑐𝑡𝑥} =𝑝′

1 − 𝑝′2

𝑝1 − 𝑝2

𝐶𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 (𝐼 = {𝐴𝑐𝑡𝑥 = 𝑣𝑐𝑡𝑥})

=1

𝑝1 − 𝑝2

𝑐′11 + 𝑐′12𝑐11 + 𝑐12

𝑝′1 − 𝑝1 −

𝑐′21 + 𝑐′22𝑐21 + 𝑐22

𝑝′2 − 𝑝2

𝐼𝑛𝑑𝑝𝐿𝑖𝑓𝑡 𝐼 = {𝐴𝑐𝑡𝑥 = 𝑣𝑐𝑡𝑥}

=𝑛′

𝑛

𝑐′11 (𝑐′11 + 𝑐′12)(𝑐′11 + 𝑐′21−

𝑐′21 (𝑐′21 + 𝑐′22)(𝑐′21 + 𝑐′11

𝑐11 (𝑐11 + 𝑐12)(𝑐11 + 𝑐21−

𝑐21 (𝑐21 + 𝑐22)(𝑐21 + 𝑐11

𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑𝐼𝑛𝑑𝑝𝐿𝑖𝑓𝑡 𝐼 = 𝐴𝑐𝑡𝑥 = 𝑣𝑐𝑡𝑥

= 𝐷𝑖𝑓𝑓𝐿𝑖𝑓𝑡 𝐼 = 𝐴𝑐𝑡𝑥 = 𝑣𝑐𝑡𝑥 ⋅ 1 −1

𝑖𝐼

63

PROBLEM WITH CONTRIBUTION

𝑝1 = 0.6, 𝑝1 = 0.3, 𝑝′1 = 0.7 𝑝′2 = 0.5

𝐷𝑖𝑓𝑓𝐿𝑖𝑓𝑡 =0.7 − 0.5

0.6 − 0.3> 0

𝐶𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 =1

0.6 − 0.3

𝑛′1

𝑛10.7 − 0.6 −

𝑛′2

𝑛20.5 − 0.3

which may be negative, depending on 𝑛′

1

𝑛1and

𝑛′2

𝑛2.

64

WHAT MAKES A HYPOTHESIS INTERESTING?

Relationship between relative support of mined hypothesis

and interestingness is not straightforward

Undirected shrinkage:

Hinitial T1 T2

C1 50 (71%) 20

C2 60 (67%) 30

I = {Actx = vctx} T1 T2

C1 45 (75%) 15

C2 55 (69%) 25

𝐷𝑖𝑓𝑓𝐿𝑖𝑓𝑡 = 1.31,AdjustedIndpLift = 0.05

65

FUTURE WORK

Scalability

Other types of supervised learning models for context

mining

Multiple context items for mined hypotheses

Improved hypothesis mining metrics

Visualizations of mined hypotheses

66

redhyte: towards a self-diagnosing, self-correcting, and helpful analytic platform

Data & Analytics