data discretization unification
DESCRIPTION
Data Discretization Unification. Ruoming Jin Yuri Breitbart Chibuike Muoh Kent State University, Kent, USA. Outline. Motivation Problem Statement Prior Art & Our Contributions Goodness Function Definition Unification Parameterized Discretization - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/1.jpg)
Data Discretization Unification
Ruoming JinYuri Breitbart
Chibuike MuohKent State University, Kent, USA
![Page 2: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/2.jpg)
Outline
• Motivation• Problem Statement• Prior Art & Our Contributions• Goodness Function Definition• Unification• Parameterized Discretization• Discretization Algorithm & Its Performance
![Page 3: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/3.jpg)
Motivation
Age Success Failure Total18 10 1 1925 5 2 741 100 5 105… …. …. ….51 250 10 26052 360 5 36553 249 10 259…. ….. ….. …..
Patients Table
![Page 4: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/4.jpg)
Motivation
Possible implication of the table:– If a person is between 18 and 25, the probability of
procedure success is ‘much higher’ than if the person is between 45 and 55
– Is that a good rule or this one is better: If a person is between 18 and 30, the probability of procedure success is ‘much higher’ than if the person is between 46 and 61
• What is the best interval?
![Page 5: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/5.jpg)
Motivation• Without data discretization some rules would be difficult to establish.
• Several existing data mining systems cannot handle continuous variables without discretization.
• Data discretization significantly improves the quality of the discovered knowledge.
• New methods of discretization needed for tables with rare events.
• Data discretization significantly improves the performance of data mining algorithms. Some studies reported ten fold increase in performance. However:
Any discretization process generally leads to a loss of information. Minimizing such a possible loss is the mark of good discretization method.
![Page 6: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/6.jpg)
Problem Statement
Intervals Class1 Class2 …. Class J Row Sum
S1 r11 r12 …. r1J N1
S2 r21 r22 …. r2J N2
.
.
.
.
.
.
.
.
.
….….….
.
.
.
.
.
.
SI rI1 rI2 … rIJ NI
Column Sum
M1 M2 MJ N(Total)
Given an input table:
![Page 7: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/7.jpg)
Problem Statement
Intervals Class1 Class2 …. Class J Row Sum
S1 C11 C12 …. C1J N1
S2 C21 C22 …. C2J N2
.
.
.
.
.
.
.
.
.
….….….
.
.
.
.
.
.
SI’ CI’1 CI’2 … CI’J NJ
Column Sum
M1 M2 MJ N(Total)
Obtain an output table:
Where Si= Union of consecutive k intervals. The quality of discretization is measured by
cost(model) =cost(data/model) +penalty(model)
![Page 8: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/8.jpg)
Prior Art
• Unsupervised Discretization – no class information is provided: – Equal-width– Equal-frequency
• Supervised Discretization – class information is provided with each attribute value:– MDLP– Pearson’s X2 or Wilks’ G2 statistic based methods
• Dougherty, Kohavi (1995) compare unsupervised and supervised methods of Holte (1993) and entropy based methods by Fayyad and Irani (1993) and conclude that supervised methods give less classification errors than the unsupervised ones and supervised methods based on entropy are better than other supervised methods .
![Page 9: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/9.jpg)
Prior Art
• There are several recent (2003-2006) papers introduced new discretization algorithms: Yang and Webb; Kurgan and Cios (CAIM); Boulle (Khiops).
• CAIM attempts to minimize the number of discretization intervals and at the same time to minimize the information loss.
• Khiops uses Pearson’s X2 statistic to select merging consecutive intervals that minimize the value of X2.
• Yang and Webb studied discretization using naïve Bayesian classifiers. They report that their method generates a lower number of classification errors than the alternative discretization methods that appeared in literature
![Page 10: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/10.jpg)
Our Results
• There is a strong connection between discretization methods based on statistic and on entropy.
• There is a parametric function so that any prior discretization method is derivable from this function by choosing at most two parameters.
• There is an optimal dynamic programming method that derived from our discretization approach that mostly outperforms any prior discretization method in experiments that we conducted.
![Page 11: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/11.jpg)
Goodness Function Definition(Preliminaries)
Intervals Class1 Class2 …. Class J Row Sum
S1 C11 C12 …. C1J N1
S2 C21 C22 …. C2J N2
.
.
.
.
.
.
.
.
.
….….….
.
.
.
.
.
.
SI’ CI’1 CI’2 … CI’J NJ
Column Sum
M1 M2 MJ N(Total)
![Page 12: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/12.jpg)
Goodness Function Definition(Preliminaries)
• Binary encoding of a row requires H(Si) binary characters.• Binary encoding of a set of rows requires H( S1, S2, … SI’ ) binary
characters• Binary encoding of a table requires SL binary characters
Entropy of the i-th row of
a contingency table
Total entropy of all
intervals
Entropy of a contingency
table:
I
i
J
j
ijijl NCCS1 1
)/log(
:
N*H( S1 ,S2 , ….SI’ ) = SL
![Page 13: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/13.jpg)
Goodness Function Definition• Cost(Model, T)= Cost(T/Model)+Penalty(Model)
(Mannila, et. al.)
• Cost(T/Model) is the complexity of table encoding in
the given model.
• Penalty(Model) reflects a complexity of the resulting
table.
• GFModel (T) = Cost(Model, T0) – Cost(Model, T)
![Page 14: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/14.jpg)
Goodness Function DefinitionModels To Be Considered
• MDLP (Information Theory)
• Statistical Model Selection
• Confidence Level of Rows Independence
• Gini Index
![Page 15: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/15.jpg)
Goodness Function Examples
• Entropy:
• Statistical Akaike (AIC):
• Statistical Bayesian (BIC):
JJIINISTMDLPCost L log)1(')1'(
log)1'(),(
),()0,()( TMDLPCostTMDLPCostTGFMDLP
)1('22),( JISTAICCost L
),()0,()( TAICCostTAICCostTGFAIC
NJISTBICCost L log)1('2),(
),(),()( 0 TBICCostTBICCostTGFBIC
![Page 16: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/16.jpg)
Goodness Function Examples (Con’t)
• Confidence level based Goodness Functions:Pearson X2 –statistic where
Wilks’ G2-statistic
• Table’s degree of freedom is df=(I’-1)(J-1)• Distribution functions for these statistics are
• It is known in statistic that asymptotically both Pearson’s X2 and Wilks G2 statistics have chi-square distribution with df
degrees of freedom.
![Page 17: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/17.jpg)
Unification
• The following relationship between G2 and goodness functions for MDLP, AIC, and BIC holds:
G2/2 = N*H(S1U ……USI’) – SL
• Thus, the goodness functions for MDLP, AIC and BIC can be rewritten as follows:
1'log)1'(2log2)( 2
INIJdfGTGFMDLP
dfGTGFAIC 2)(
2log*)( 2 NdfGTGFBIC
![Page 18: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/18.jpg)
Unification• Normal Deviate is the difference between the mean and the
variable divided by standard deviation
• Consider t a random variable chi-square distributed. U(t) be a normal deviate so that the following equation holds
• Let u(t) be a normal deviate function. The following theorem holds (see Wallace 1959, 1960)
For all t>df, all df>.37, and with w(t)=[t-df-df*log(t/df)](1/2)
0<w(t)≤u(t)≤w(t)+.6*df-(1/2)
![Page 19: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/19.jpg)
Unification
• From this theorem it follows that if df goes to infinity and w(t)>>0,
u(t)/w(t) ~1.
• Finally, w2(t) ~ u2(t) ~t – df – df*log(t/df) and
GFG2(T)=u2(G2)=G2 - df*(1+ log(G2 /df))
and similarly goodness function for GFX2(T)
is asymptotically the same
![Page 20: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/20.jpg)
Unification
• G2 estimate– If G2 >df, then G2 <2N*logJ. It follows from the upper bound
logJ on H(S1 U ….U SJ) and lower bound 0 on entropy of a specific row of the contingency table.
– Recall that u2(G2)=G2 - df*(1+ log(G2 /df))– Thus, penalty of u2(G2) is between O(df) and O(df*logN)
• If G2 ~ c*df and c>1, then penalty is O(df)• If G2 ~ c*df and N/df ~N/(I’J)=c, then penalty is also O(df)• If G2 ~ c*df and N->inf and N/(I’J) ->inf, then penalty is O(df*logN)
![Page 21: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/21.jpg)
Unification
• GFMDLP = G2 –(depending on the ratio of N/I’) O(df)
O(dflogN)• GFAIC = G2 - df
• GFBIC = G2 - df*(logN)/2
• GFG2= G2 - df([either constant or logN, depending on
the ratio between N and I and J]• In general, GF = G2 - df*f(N,I,J)• To unify a Gini function as one of the cost functions we resort to the parametric approach to goodness of
discretization
![Page 22: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/22.jpg)
Gini Based Goodness Function
• Let Si be a row of a contingency table. Gini index on Si is defined as follows:
• Cost(Gini, T) =
• Gini Goodness Function:
)(),('
1
i
Ii
i
i SGiniNTGiniCost
)1'(2)(1
2'
1 1
2
inM
NcTGF
J
j
JI
i
J
j i
ijGini
![Page 23: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/23.jpg)
Parametrized Discretization
• Parametrized Entropy
• Entropy of the row
• Gini of the row
• Parametrized Data Cost
![Page 24: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/24.jpg)
Parametrized Discretization• Parametrized Cost of T0
• Parametrized Goodness Function
![Page 25: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/25.jpg)
Parameters for Known Goodness Functions
![Page 26: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/26.jpg)
Parametrized Dynamic Programming Algorithm
![Page 27: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/27.jpg)
Dynamic Programming Algorithm
![Page 28: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/28.jpg)
Experiments
![Page 29: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/29.jpg)
![Page 30: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/30.jpg)
![Page 31: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/31.jpg)
Classification errors for Glass dataset (Naïve Bayesian)
![Page 32: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/32.jpg)
Iris+C4.5
![Page 33: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/33.jpg)
Experiments: C4.5 Validation
![Page 34: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/34.jpg)
Experiments: Naïve Bayesian Validation
![Page 35: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/35.jpg)
Conclusions
• We considered several seemingly different approaches to discretization and demonstrated that they can be unified by considering a notion of generalized entropy.
• Each of the methods that were discussed in literature can be derived from generalized entropy by selecting at most two parameters.
• Dynamic Programming Algorithm for a given set of two parameters selects an optimal method of discretization (in terms of the discretization goodness function)
![Page 36: Data Discretization Unification](https://reader035.vdocuments.us/reader035/viewer/2022062218/56815d43550346895dcb4a2c/html5/thumbnails/36.jpg)
What Remains To be Done
• How to find analytically a relationship between the goodness function in terms of the model and the number of classification errors?
• What is the Algorithm for Selecting the Best Parameters for a Given Set of Data?