redhyte: towards a self-diagnosing, self-correcting, and helpful analytic platform
TRANSCRIPT
REDHYTE:
TOWARDS A SELF-DIAGNOSING, SELF-
CORRECTING, AND HELPFUL ANALYTIC
PLATFORM
Isaac TOH Wei Zhong, CHOI Kwok Pui, WONG Limsoon
National University of Singapore
ACIIDS 2016
8th Asian Conference on Intelligent Information and
Database Systems
14-16 March 2016, Da Nang, Vietnam
A BIT ON ME
Data Scientist at Singapore Telecommunications Limited
(Singtel)
Background in Computational Biology / Bioinformatics
Data mining, machine learning, R
Delivery of Advanced Analytics to mostly government
agencies in the area of Defence and Public Safety
2
OUTLINE
Introduction
Implementation
Conclusion
3
INTRODUCTION
OVERVIEW
Our objective was to develop a system that complements
the process of analysis of scientific data
In particular, we are focusing on data exploration and
hypothesis testing
5
BACKGROUND
Data analysis: the process of inspecting, visualizing, and
modeling data so as to derive knowledge and insights
One of the primary tools of the scientist and data analyst
is to make comparisons, which is formally known as
hypothesis testing
6
CONVENTIONAL DATA ANALYSIS
Conventional data analysis workflows primarily consist of
these steps:
Have a question in mind
Formulate some assertion or hypothesis
Design an appropriate experiment (Experimental design)
Collect and clean relevant data
Test that hypothesis using the collected data and statistical
techniques, in order to decide whether to reject or not reject it
7
CONVENTIONAL DATA ANALYSIS
Putting together a hypothesis with a statistical test allows
the analyst to make justifiable conclusions from the data
The process starts from the initial question/hypothesis in
mind
8
CONVENTIONAL DATA ANALYSIS
In short, conventional data analysis (and science) is expert-
dependent, requiring ample domain knowledge, intuition
and experience
This has been very appropriate and successful in the pre-
”Big Data” era
9
DATA ANALYSIS IN “BIG DATA” CONTEXT
Centralized storage, high quality curation, convenient
retrieval and dissemination (data pipelines)
High throughput assaying (-omics)
Electronic medical and health records (EMR, EHR)
Data is assembled and pulled from wherever we can go our
hands on
Are conventional data analysis approaches (question →
analysis → insights) still viable?
Two phenomena
10
DATA ANALYSIS IN “BIG DATA” CONTEXT: TWO
PHENOMENA
Collection of data without scientific question and
experimental design a priori
In a traditional cohort or cross-section study, subjects are
carefully selected
Assumptions of any statistical tests to be used are met
Routine collection of data (using established data
pipelines) makes it easy for statistical assumptions to be
dissatisfied
E.g. t-test
11
DATA ANALYSIS IN “BIG DATA” CONTEXT: TWO
PHENOMENA
“large p small n”, or having large number of variables or
attributes in datasets
Formulating a hypothesis concerning a small number of
attributes and testing in a large dataset while ignoring other
attributes is
Wasteful
Flawed
12
AN EXAMPLE
Some dataset with 100 attributes lying around in the repository, we are interested in the relationship between a small number of them (guided by domain knowledge)
Statistical test or correlation
Which tests? Or use correlation? Look at the types of the attributes:
Both numerical: correlation
One numerical, one categorical: t-test or ANOVA
Both categorical: χ2-test
Perhaps their non-parametric equivalents?
Get a p-value, and make a conclusion about the initial domain knowledge-driven question
But is this end of the story?13
WHAT COULD GO WRONG?
Violations of statistical tests assumptions
Especially so for datasets collected without prior scientific
questions (vs. traditional scientific studies)
Correlation and statistical tests consider the 2 attributes “in
vacuum”, ignoring the effects of other attributes
Confounding: Simpson’s Paradox (omitted “third variable”)
14
SIMPSON’S PARADOX
Classic example (kidney stones):
Success Failure Total
Treatment A 273 (78.0%) 77 350
Treatment B 289 (82.6%) 61 350
Total 562 138 700
Small stones Success Failure Total
Treatment A 81 (93.1%) 6 87
Treatment B 234 (86.7%) 36 270
Total 315 42 357
Large stones Success Failure Total
Treatment A 192 (73.0%) 71 263
Treatment B 55 (68.8%) 25 80
Total 247 96 343
15
INFAMOUS UC BERKELEY EXAMPLE
Admitted Rejected Total
Males 1198 (44.5%) 1493 (55.5%) 2691
Females 557 (30.4%) 1278 (69.6%) 1835
Total 1755 2771 4526
Dept A Admitted Rejected Total
Males 512 (62.1%) 313 (37.9%) 825
Females 89 (82.4%) 19 (17.6%) 108
Total 601 332 933
16
WHAT COULD BE IMPROVED?
Test diagnostics: checking of tests assumptions
17
WHAT COULD BE IMPROVED?
Test diagnostics: checking of tests assumptions
Consider the other attributes in the large dataset
To account for confounding
To incorporate information of other attributes into hypothesis:
trend amplification for a certain strata/class of a third
attribute
18
TREND AMPLIFICATION
High income Low income Total
Administrative 507 (13%) 3263 (87%) 3770
Craftsmen 929 (23%) 3170 (77%) 4099
Total 1436 6433 7869
Some college
education
High income Low income Total
Administrative 142 (11%) 3263 (89%) 1281
Craftsmen 241 (28%) 3170 (72%) 868
Total 383 1766 2149
19
WHAT COULD BE IMPROVED?
Test diagnostics: checking of tests assumptions
Consider the other attributes in the large dataset
Simpson’s Paradox, trend amplification
Hypothesis analysis: looking deeper into hypotheses
20
HYPOTHESIS ANALYSIS
Vaccine Had flu Avoided flu Total
A 43 237 280
B 52 198 250
C 25 245 270
D 48 212 260
E 57 233 290
Total 225 1125 1350
H0: all vaccines are equally effective
Use a χ2-test
21
HYPOTHESIS ANALYSIS (2)
χ2 = 13.803 + 2.761 = 16.564, d.f. = 4
p-value < 0.05
Vaccine Had flu (O-E)2/E Avoided flu (O-E)2/E
A 43 (46.7) 0.293 237 (233.3) 0.059
B 52 (41.7) 2.544 198 (208.3) 0.509
C 25 (45.0) 8.889 245 (225) 1.778
D 48 (43.3) 0.510 212 (216.7) 0.102
E 57 (48.3) 1.567 233 (241.7) 0.313
Total 225 13.803 1125 2.761
22
HYPOTHESIS ANALYSIS (3)
Vaccine C contributes 64.4% of the χ2 test statistic
Clearly there is something special with vaccine C
Vaccine Had flu (O-E)2/E Avoided flu (O-E)2/E
A 43 (46.7) 0.293 237 (233.3) 0.059
B 52 (41.7) 2.544 198 (208.3) 0.509
C 25 (45.0) 8.889 245 (225) 1.778
D 48 (43.3) 0.510 212 (216.7) 0.102
E 57 (48.3) 1.567 233 (241.7) 0.313
Total 225 13.803 1125 2.761
23
HYPOTHESIS ANALYSIS (4)
Vaccine Had flu Avoided flu Total
A 43 237 280
B 52 198 250
D 48 212 260
E 57 233 290
Total 225 1125 1350
Without vaccine C, χ2 = 2.983, d.f. = 3
p-value > 0.1, not significant
24
HYPOTHESIS ANALYSIS (5)
Vaccine Had flu Avoided flu Total
C 25 245 270
A, B, D, E 200 880 1080
Total 225 1125 1350
χ2 = 12.7, d.f. = 1
p-value < 0.001, significant
25
PROBLEMS AND IMPROVEMENTS
Violation of statistical test assumptions
Consider the other attributes in the large dataset
Simpson’s Paradox, trend amplification
Hypothesis analysis
These are potential problems and improvements that are
often
Undiscovered
Discovered by chance
Most importantly, not addressed in unison
26
BUT HOW?
How do we sieve these out all at once? Do we
a) Continue to rely on domain knowledge?
b) Iteratively go through the other attributes (and their
stratifications) and scrutinize them? Or,
c) Rely on some form of algorithm and automation to sieve
them out automatically, and then fall back again on
domain knowledge to scrutinize the output?
27
DATA MINING
Data mining is a well-established class of techniques
commonly used to search for interesting and global
relationships in large datasets
Classification
Clustering
Frequent pattern mining
Data mining techniques can account for large number of
attributes at once
28
DATA MINING VS. HYPOTHESIS TESTING
Problem with data mining is that it is does not contribute
to the fundamental endeavor of making comparisons
Knowing that an attribute A contributes greatly to the
classification of a response attribute R is not nearly as intuitive
as putting A and R in a contingency table, as below
Gene A Diseased Control Total
Up-reg. 43 27 70
Not up-reg. 12 44 56
Total 55 71 126
29
MOTIVATION
Therefore, in this piece of work we have developed a
system named “Redhyte”
Short for Rapid Exploration of Data and Hypothesis Testing
Using data mining techniques in a specific and novel
manner, Redhyte allows users to remain in the hypothesis
testing framework in a large and possibly unexplored
dataset
30
MOTIVATION
Redhyte first takes in the user’s initial hypothesis, which
may be very general or intuitive
For example, does smoking increase risk of lung cancer?
Using this initial hypothesis, Redhyte
Test, diagnose, analyze the initial hypothesis/test
Generates hypotheses that are potentially interesting to the user
Simpson’s Paradox, trend amplification
We call the chief objectives of Redhyte “hypothesis
analysis” and “hypothesis mining”
31
IMPLEMENTATION
FRAMEWORK
Main modules
1. Initial test
2. Test diagnostics and hypothesis analysis
3. Context mining
4. Mined hypotheses formulation, scoring, ranking
Other auxiliary functionalities
Data visualizations
Log documentation for reproducibility of results
Hypothesis
mining
34
Self-diagnosing,
self-correcting
1
2
3
4
CONTEXT MINING
Given an initial hypothesis, it is possible to include
additional attributes to make the hypothesis more specific
36
Small stones Success Failure Total
Treatment A 81 (93.1%) 6 87
Treatment B 234 (86.7%) 36 270
Total 315 42 357
CONTEXT MINING
Context mining is concerned with the search for such
attributes, to give the initial hypothesis some “context”
Generates a list of attributes that may be interesting to
consider as mined context attributes
37
CONTEXT MINING
Using classification models from data mining:
Build two classification models, each predicting the target
attribute and the comparing attribute
Target attribute: stipulated in the initial hypothesis, the attribute
that represents outcome/response
Comparing attribute: the basis of comparison
Relapse No Relapse
Treatment
Placebo
38
CLASSIFICATION MODELS IN CONTEXT MINING
From the classification models, we take the top attributes
from each model (if their accuracies are high), and use
them as mined context attributes
39
MINED HYPOTHESES
Using the mined context attributes, we consider each class
within these attributes,
E.g. {Dept = A}, {Dept = B}, … {Dept = F}
E.g. {Kidney stones = small}, etc.
These are the mined context items, to be inserted into
the initial hypothesis to form mined hypotheses
40
INTUITION FOR USING CLASSIFICATION IN
CONTEXT MINING
If some attribute A contributes to the classification of
either the target or the comparing attribute, then A is
somehow associated with either of them
Specifically, adding a particular class of A into the initial
hypothesis may result in trend amplification or reversals
41
CONTEXT MINING
Classification model of choice: random forests
High accuracy
Able to do attribute selection
Able to tolerate levels of class-imbalance better than most
classifiers
Robust to redundancies and multicollinearity
Does not require linearity
Does not require training and testing datasets for cross-
validation
42
SCORING AND RANKING OF MINED
HYPOTHESES
With each mined context items, we compute four different
scores or hypothesis mining metrics:
Difference lift
Contribution
Independence lift
Adjusted independence lift
Each metric was designed to capture specific aspects of
“interestingness” of a mined hypothesis
43
ASPECTS OF INTERESTINGNESS
Trend changes: trend amplification, Simpson’s Reversal
Relative support: if a mined context item shrinks the
subpopulations of the hypothesis too much, it is
considered too specific and less interesting
Shrinkage manner: the way with which the mined context
item shrinks the subpopulations of the initial hypothesis
44
SHRINKAGE MANNER
Arguably, directed shrinkage may be interesting to consider –
especially in a large dataset, where it is possibly unexpected
Undirected Directed
T1 T2
C1 50 20
C2 60 30
T1 T2
C1 45 15
C2 55 25
T1 T2
C1 10 17
C2 50 22
45
CONCLUSION
IN SUMMARY
We have developed Redhyte, an interactive platform for
rapid exploration of data and hypothesis testing
Redhyte is capable of:
Automated statistical test diagnostics and hypothesis analysis
Automated discoveries of Simpson’s Paradoxes, etc. via
hypothesis mining
What’s novel in Redhyte:
χ2 contributions
Hypothesis mining, context mining using classification
New hypothesis mining metrics
48
ADDITIONAL SLIDES
Some issues
Examples for terminologies
Test diagnostics and hypothesis analysis
Context mining vs. CMH test
Equivalent methods in context mining
Class-imbalance learning: adjusted geometric mean
Definitions of hypothesis mining metrics
Future work
52
SOME ISSUES
Scalability
Class-imbalance learning
Multiple testing
53
INITIAL HYPOTHESIS (T-TEST)
Hinitial: Is there a difference in AGE when comparing the
samples on WORKCLASS between {State-gov} vs.
{Private}?
Target attribute: Atgt = Age
Comparing attribute, Acmp = Workclass
Initial context, Cinitial= {Workclass = State-gov, Workclass =
Private}
Tinitial: t-test on means of Age
54
INITIAL HYPOTHESIS (T-TEST)
Hinitial : In the context of {GENDER = Male}, is there a
difference in AGE when comparing the samples on
WORKCLASS between {State-gov} vs. {Private}?
Atgt = Age
Acmp = Workclass
Cinitial= {Gender = Male, Workclass = State-gov, Workclass =
Private}
Tinitial: t-test on means of Age
55
INITIAL HYPOTHESIS (Χ2-TEST)
Hinitial : In the context of {GENDER = Male}, is there a
difference in EDUCATION between {Bachelors} vs.
{Masters} when comparing the samples on WORKCLASS
between {State-gov} vs. {Private}
Atgt = Education
Acmp = Workclass
vtgt = Bachelors
Cinitial= {Gender = Male, Education = Bachelors, Education =
Masters, Workclass = State-gov, Workclass = Private}
Tinitial: χ2-test on 2x2 contingency table
56
INITIAL HYPOTHESIS (COLLAPSED Χ2-TEST)
Hinitial : In the context of {GENDER = Male}, is there a
difference in EDUCATION between {Bachelors} vs.
{Masters} when comparing the samples on WORKCLASS
between {State-gov & Federal-gov} vs. {Private}
Atgt = Education
Acmp = Workclass
vtgt = Bachelors
Cinitial= {Gender = Male, Education = Bachelors, Education =
Masters, Workclass = State-gov, Workclass = Federal-gov
Workclass = Private}
Tinitial: χ2-test on 2x2 contingency table
57
TEST DIAGNOSTICS AND HYPOTHESIS ANALYSIS
t-test:
Normality: Shapiro-Wilk test
Equal variances: F-test
Nonparametric alternative: Mann-Whitney U
Collapsed χ2-test:
χ2 contributions
Cochran-Mantel-Haenszel test for independence
assumption
58
CONTEXT MINING VS. CMH TEST
CMH test requires all trends across all k tables to be the
same direction
Multiple testing
59
EQUIVALENT METHODS IN CONTEXT MINING
Could have used regression-based models, e.g. logistic
regression, or even correlation
But with potential multicollinearity, nonlinearity, and class-
imbalance, GLMs are not ideal
Also require model to give some form of variable
importance
60
CLASS-IMBALANCE LEARNING: ADJUSTED
GEOMETRIC MEAN
𝐺𝑚 refers to the geometric mean accuracy of the model, 𝑆𝑃the specificity of the model, and 𝑁𝑛 the proportion of
samples that belong to the majority class
𝐴𝐺𝑚 =𝐺𝑚 + 𝑆𝑃 ⋅ 𝑁𝑛
1 + 𝑁𝑛
𝐺𝑚 = (𝑆𝑃 ⋅SE)0.5
61
DEFINITIONS OF HYPOTHESIS MINING METRICS
I = {Actx = vctx} T1 T2
C1 c'11(p'1) c'12
C2 c'21 (p'2) c'22
Hinitial T1 T2
C1 c11(p1) c12
C2 c21 (p'2) c22
62
DEFINITIONS OF HYPOTHESIS MINING METRICS
𝐷𝑖𝑓𝑓𝐿𝑖𝑓𝑡 𝐼 = {𝐴𝑐𝑡𝑥 = 𝑣𝑐𝑡𝑥} =𝑝′
1 − 𝑝′2
𝑝1 − 𝑝2
𝐶𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 (𝐼 = {𝐴𝑐𝑡𝑥 = 𝑣𝑐𝑡𝑥})
=1
𝑝1 − 𝑝2
𝑐′11 + 𝑐′12𝑐11 + 𝑐12
𝑝′1 − 𝑝1 −
𝑐′21 + 𝑐′22𝑐21 + 𝑐22
𝑝′2 − 𝑝2
𝐼𝑛𝑑𝑝𝐿𝑖𝑓𝑡 𝐼 = {𝐴𝑐𝑡𝑥 = 𝑣𝑐𝑡𝑥}
=𝑛′
𝑛
𝑐′11 (𝑐′11 + 𝑐′12)(𝑐′11 + 𝑐′21−
𝑐′21 (𝑐′21 + 𝑐′22)(𝑐′21 + 𝑐′11
𝑐11 (𝑐11 + 𝑐12)(𝑐11 + 𝑐21−
𝑐21 (𝑐21 + 𝑐22)(𝑐21 + 𝑐11
𝐴𝑑𝑗𝑢𝑠𝑡𝑒𝑑𝐼𝑛𝑑𝑝𝐿𝑖𝑓𝑡 𝐼 = 𝐴𝑐𝑡𝑥 = 𝑣𝑐𝑡𝑥
= 𝐷𝑖𝑓𝑓𝐿𝑖𝑓𝑡 𝐼 = 𝐴𝑐𝑡𝑥 = 𝑣𝑐𝑡𝑥 ⋅ 1 −1
𝑖𝐼
63
PROBLEM WITH CONTRIBUTION
𝑝1 = 0.6, 𝑝1 = 0.3, 𝑝′1 = 0.7 𝑝′2 = 0.5
𝐷𝑖𝑓𝑓𝐿𝑖𝑓𝑡 =0.7 − 0.5
0.6 − 0.3> 0
𝐶𝑜𝑛𝑡𝑟𝑖𝑏𝑢𝑡𝑖𝑜𝑛 =1
0.6 − 0.3
𝑛′1
𝑛10.7 − 0.6 −
𝑛′2
𝑛20.5 − 0.3
which may be negative, depending on 𝑛′
1
𝑛1and
𝑛′2
𝑛2.
64
WHAT MAKES A HYPOTHESIS INTERESTING?
Relationship between relative support of mined hypothesis
and interestingness is not straightforward
Undirected shrinkage:
Hinitial T1 T2
C1 50 (71%) 20
C2 60 (67%) 30
I = {Actx = vctx} T1 T2
C1 45 (75%) 15
C2 55 (69%) 25
𝐷𝑖𝑓𝑓𝐿𝑖𝑓𝑡 = 1.31,AdjustedIndpLift = 0.05
65
FUTURE WORK
Scalability
Other types of supervised learning models for context
mining
Multiple context items for mined hypotheses
Improved hypothesis mining metrics
Visualizations of mined hypotheses
66