1 intelligent choices preceeding data analysis katharina morik univ. dortmund, knowledge discovery...
TRANSCRIPT
1
Intelligent Choices Preceeding Data Analysis
Katharina MorikUniv. Dortmund, www-ai.cs.uni-dortmund.de
• Knowledge Discovery in Databases (KDD)• The Mining Mart approach• Case studies
– Item sales– Intensive care
2
The UCI Library Approach
• Learning task: classification• Evaluation criteria: accuracy and coverage• Data sets
– Small number of examples– Small number of features– All and only relevant features included – No noise
3
KDD Task
• Learning task of the application needs to be transferred to a formal learning task (classification, regression, clustering)– „I want to predict sales 4 weeks ahead“– „I want to know more about my best (worst) customers“– „I want to detect fraud“
• Databases– Very large number of records– Very large number of features– Relevant fatures missing– Noise included
4
Observation
Experienced users can apply any learning system successfully to any application, since they prepare the data well...
• The representation LE of examples and the choice of a sample determines the applicability of learning methods.
• A chain of data transformations (learning steps or manual preprocessing) leads to LE of the method that delivers the desired result.
Experienced users remember prototypical successful transformation/learning chains
5
LE1
LH1
LE2
LH2
...
LEn-1
LHn-1 = LEn
LHn
LHn+m
LEn+m
...
LHn+1
= LEn+1
learning/data mining
The Real Process
application:data users performance system
6
Intelligent Choices
80% of the KDD work is invested into:• Choosing the learning task• Sampling• Feature generation, extraction, and selection• Data cleaning• Model selection or tuning the hypothesis space• Defining appropriate evaluation criteria
7
The Mining Mart Approach
Best practice cases of preprocessing chains exist...
• Data, LE and LH are described on the meta level.
• The meta-level description is presented in application terms.
• MiningMart users choose a case and apply the corresponding transformation and learning chain to their application.
... and more can be obtained!
8
Call for Participation
• MiningMart develops an operational meta-language for describing data and operators.
• MiningMart prepares the first cases of KDD.• MiningMart will present the case-base in the WWW.• You may contribute to the endavour!
– Apply the meta-language to your application and deliver it as a positive example to the case-base; or
– apply a case of MiningMart to your data.
9
The Consortium
• Katharina Morik Univ. Dortmund, D (Coordinator)• Lorenza Saitta Univ. Piemonte del Avogadro, I• Pieter Adriaans Perot Systems Netherland, NL
? Michael May GMD, D• Jörg-Uwe Kietz SwissLife, CH• Fabio Malabocchia TILab, I
10
The Mining Mart System
Manual Pre-processingOperators: Time multi-relation
Meta-dataApplicability
ML-Operators: Time Parameters Features Description
Logic
Raw-data
Meta-data
Augmented data of results
Meta-data
KDD process tasks, problem models
Case-base of successful KDD process
Human Computer Interface
11
The Meta Model for Meta Data
The Relational Modeldescribes the database
The Conceptual Modeldescribes the individuals and classes of the domainwith their relations
The Case Modeldescribes chains of preprocessing operators
The Execution Modelgenerates SQL statementsor calls to external tools
12
Use of the Meta Model
• The meta model is stored in a database.• The database manager delivers the relational model.• The data analyst delivers the conceptual model.• The KDD expert delivers or adjusts a case model.
First cases are delivered by the Mining Mart project.• The system compiles meta data into SQL statements
and calls to external tools hence executing the case model on the data.
13
Sales of Items of a Drugstore
0
20
40
60
80
100
120
140
160
01
|96
07
|96
13
|96
19
|96
25
|96
31
|96
37
|96
43
|96
49
|96
03
|9
7 09
|97
15
|97
21
|97
27
|97
33
|97
39
|97
45
|97
51
|97
05
|98
11
|98
17
|98
23
|98
29
|98
35
|98
41
|98
47
|98
53
|98
Week
Sal
es
Insect killers 1Insect killers 2Sun milkCandles 1Baby food 1BeautySweetsSelf-tanning creamCandles 2Baby food 2
14
Learning Task 1:Predict Sales of an Item
Given drug store sales data of 50 items in 20 shops over 104 weeks
predict the sales of an item such that
the prediction never underestimates the sale,
the prediction overestimates less than the rule of thumb.
Observation: 90% of the items are sold less than 10 times a week.
Requirement: prediction horizon is more than 4 weeks ahead.
15
Shop Application -- Data
Shop Week Item1 ... Item50Dm1 1 4 ... 12Dm1 ... ... ... ...Dm1 104 9 ... 16Dm2 1 3 ... 19... ... ... ... ...Dm20 104 12 ... 16
LE DB1: I: T1 A1 ... A 50; set of multivariate time series
16
Preprocessing
• From shops to items: multivariate to univariate
LE1´: i:t1 a1 ... tk ak
For all shops for all items:Create view Univariate asSelect shop, week, itemi
Where shop=“dmj”From Source;
• Multiple learning
Dm1_Item1...
1 4 ... 104 9
Dm1_Item50 1 12... 104 16
....
Dm20_Item50 1 14... 104 16
17
Method 1 for Task 1:Exponential Smoothing
• Univariate time series as input ( LE1` ),
• incremental method: current hypothesis h and new observation o yield next hypothesis by h := h + o, where is given by the user,
• predicts sales of n-next week by last h.
18
Method 2 for Task 1:SVM in the Regression Mode
• Multiple learning: for each shop and each item, the support vector machine learns a function which is then used for prediction.
• Asymmetric loss:– underestimation is multiplied by 20,
i.e. 3 sales too few predicted -- 60 loss– overestimation is counted as it is,
i.e. 3 sales too much predicted -- 3 loss(Stefan Rüping 1999)
19
Further Preprocessing
• Obtaining many vectors from one series by sliding windows
LH5 i:t1 a1 ... tw aw
move window of size w by m stepsDm1_Item1_1Dm1_Item1_2
12
4...4...
56
78
...Dm1_Item1_100 100 6... 104 9
...
...Dm20_Item50_100 100 12... 104 16
20
window horizon value to predict
time
sales
Article 766933 (bag?)
21
horizon SVM exp. smoothing
1 56.764 52.40
2 57.044 59.04
3 57.855 65.62
4 58.670 71.21
8 60.286 88.44
13 59.475 102.24
Comparison with Exponential Smoothing
22
loss
horizon
23
Learning Task 2:Learning Sequences
Are there typical sequences that are valid for all items?– After an action for an item its sales decrease.– Each decrease of sales is followed by an increase.
• Given a set of subsequent eventsfind frequent sequences.
24
Univariate time series
?
Subsequent events
?
LHn-1 = LEn
LHn frequent event sequences
From Sales Data toEvent Sequences
Multivariate time series
?
25
From Series to Sequences
• Given some time series detect events (states, intervals)
An event is a triple (state, begin,finish).
The state might be a label or a (mean) value.
Typical labels are: increase, decrease, stable...
26
Unsupervised Methods
• All contiguous observations within one level (range) form one event (Bauer).
• All contiguous observations with more or less the same gradient form one event (Morik, Wessel).
• Clusters of subsequences form events (Das).
27
Moving Gradient
•• •
•
• • •
• • •
Time1 4 5 7 10 12
Determining the time intervals with user-given tolerance threshhold.Abstracting into classes of gradients: increase,peak,decrease, stable...
28
Sales of Item 182830 in Shop 55
29
Summarizing Sales byTolerant Moving Gradient
(Wessel, Morik 1999)
30
Univariate time series
Moving gradient
Subsequent events
?
LHn-1 = LEnLHn frequent event sequences
From Subsequent Eventsto Event Sequences
Multivariate time series
?
31
Transformation into Facts
stable(182830,1,33,0).decreasing(182830, 33,34,-6).stable(182830, 34, 39,0).increasing(182830, 39, 40,7).decreasing(182830, 40, 42,-5).stable(182830, 42,108,0).
LE4:
32
Summarizing Item 646152 in Shop 55 by Intolerant Moving Gradient
33
Corresponding Facts
increasing(646152,1,2,3).decreasing(646152,2,3,-11).increasingPeak(646152,3,4,22)....stable(646152, 25,37,0).increasing(646152, 37, 38, 8).decreasing(646152, 38, 39, -7).stable(646152, 39,40, 0).increasing(646152, 40, 41,7).decreasing(646152, 41, 42,-8).increasing(646152, 42, 43,10).stable(646152, 43, 48,-1).
small time intervals
34
Method 3 for Task 2: Inductive Logic Programming
• Rules about sequences:p1(I, Tb, Te, A r), p2(I, Te, Te2, As) p3(I, Te2, Te3, A t)
• Results for sequences of sales trends:increasing (Item, Tb, Te) decreasing(Item, Te, Te2) increasing (Item, Tb, Te), decreasing(Item, Te, Te2)
stable(Item, Te2, Te3)
35
Same Data -- Several Cases
• Predict sales of a particular item in a particular shopmultivariate to univariate, multiple exponential smoothing ORmultivariate to univariate, sliding windows, multiple learning with regression SVM
• Find relations between trends that are valid for all sales in all shopsmultivariate to univariate, summarizing, transformation into facts, rule learning
36
Applications in Intensive Care
• On-line monitoring of intensive care patients• high-dimensional data about patient and medication• measured every minute• stored in the Emtec database of patient records ---• learning when to intervene in which way.
37
Patient G.C., male, 60 years old Hemihepatektomie right
38
The Data
LE DB2 i 1: t 1 a 1 1 ... a 1 k
i1: t 2 a 2 1 ... a 2 k
...
i2: t 1 a 1 1 ... a 1 k
...
set of rows for each patient:1 row for each minute
39
Preprocessing
• Chaining database rowsi 1: t 1 a 1 1
... a 1 k, t 2 a 2 1 ... a 2 k , ...
• Multivariate to univariatei 1: t 1 a 1, t 2 a 1
... t m a 1
i 1: t 1 a 2, t 2 a 2 ... t m a 2
...
• Detecting level changes and outliers
40
Phase State Analysis
Phase state )1ty,t(yyt +=r
Time series Ny,...,1y
Deter-ministicProcess
yt
time t yt
yt+1
AR(1)-process with outlier (AO)
yt
timet yt
yt+1
Heart rate
HRt
time t yt
yt+1
U.Gather, M. Bauer
41
Level Change Detection
level_change(pat4999, 50, 112, hr, up)
level_change(pat4999, 112, 164, hr, down)
level_change(pat4999, 10, 74, art, constant)
level_change(pat4999, 74, 110, art, down)
Computed Feature
Comparing norm values for a vital sign and its mean in a time interval (± standard deviation):
deviation(pat4999, 10, 74, art, up)
42
Learning Task 3:Recommend Interventions for Patients
Are there valid rules
for all multivariate time series,
such that therapeutical interventions follow from a patient’s state?
43
Method 3:Inductive Logic Programming
Given patient records in the form of facts:• deviations -- time intervals• therapeutical interventions -- time points• types of vital signs (group1: hr, swi, co; group2: art, vr)
Learn rules about interventions:
group1(V), deviation(P, T1, T2, V, Dir)
noradrenaline(P, T2, Dir)
44
The Chain of Preprocessing Steps
LE DB2 : i 1: t 1 a 1 1
... a 1 k
i1: t 2 a 2 1 ... a 2 k
... i2: t 1 a 1 1
... a 1 k
chaining db rowsi 1: t 1 a 1 1
... a k1 t 2 a 1 2
... a k 2 ...i2: t 1 a 1 1
... a k 1t 2 a 12
... a k 2 ...
multi- tounivariatei 1: t 1 a 1 1 t 2 a 1 2
i 1: t 1 a 2 1 t 2 a 2 2
...
level changes(i 1,t i ,t j,A)...
computedfeature(i 1,t i ,t j,A,D)
relational learningp 1(I,T i,T j,A,D), p 2 (I,Tj,Tk,A,D)
p 3(I,Tk, Dir)
45
Learning Task 4:Predict Next Minute‘s
Intervention
Given a patient’s state at time ti,
learn whether and how to intervene at t i+1
Preprocessing:• Selection of time points where an intervention was done• Multiple to binary class
for each drug, form the concepts drug_up, drug_down
• Multiple learning for each binary class resulting inclassifiers for each drug and direction of dose change (SVM_light)
46
The Chain of Preprocessing Steps
LE DB2 : i 1: t 1 a 1 1
... a 1 k
i1: t 2 a 2 1 ... a 2 k
... i2: t 1 a 1 1
... a 1 k
Select time points with interventionsi 1: t i a 1 i
... a ki
i2: t j a 1j ... a kj
...
Form binary classesa1_up +: a 2 ...
a k
...
a1_up-: a 2 ...a k
....
a6_down+: a 2 ...a k..
a6_down-: a 2 ...
a k....
Learning classifiers using SVM_lighta1_up +: w2 a 2 ...
wk a k
...
a6_down +: w2 a 2 ... wk a k
47
Same Data -- Several Cases
• Find time relations that express therapy protocols chaining db rows, multivariate to univariate, level changes, deviations, RDT
• Predict intervention for a particular drug select time points, multiple to binary class, SVM_light
48
Behind the Boxes
Db schemaindicating time attribute(s),granularity,...
Select statement in abstract form,instantiated by db schema
Creating views in abstract form,instantiated by db schema andlearning task
Calling SVM_light and writing results
Syntactictransformation for SVM
Multiple learning control
49
Functionality of MD-Compiler
Table_x Table_y Table_zBase Tables
V_01
RowSelection
V_02 V_03
RowSelection RowSelection
Views,created by theMD-Compiler
V_01 vc_1
MultiColumnFeatureConstruction
vc_2 vc_3
V_04
FeatureSelection
RowSelection
V_05
Manual preprocessing operators of M4 are very elementary.
Results of operators are mostly views.
50
Several view definitions
CREATE VIEWV_01(attrib_a , attrib_b , attrib_c )ASSELECTx_id , x_a, x_bFROM Table_x;
Information from M4-Relational(BaseAttribute)
Inline-View
Physical View-Object
created by MD-Compiler
for reading data and
executing statistics
Inline-View
stored as sql-string
in M4-Relational
Inline View versus Physical View-Object
51
Several view definitions
Inline View versus Physical View-Object
Table_x Table_y Table_zBase Tables
V_01
RowSelection
V_02 V_03
RowSelection RowSelection
Views,created by theMD-Compiler
V_01 vc_1
MultiColumnFeatureConstruction
vc_2 vc_3
V_04
FeatureSelection
RowSelection
V_05
Create View V_05 (...) as
select ... from (select ... from (select ... from Table_x)
instead of
Create View V_05 (...) as select ... from V_04
52
Several view definitions
Materialized view:
Materialized_View_1
Table_x Table_y
V_01
V_04
Table_z
V_02 V_03
V_05
V_06
V_07
Created by the MD-Compiler
automatically in the background
+ performance gain when selecting
data from V_04 or V_07
+ all operation-outputs can be realized
as views- additional storage needed
53
System-Architecture
Mining Mart Database
T1
T4
T2 T3
T5
T6
MD Compiler
Java-Code
M4-RelationEditor
Statistics
PL/Sql
TimeOperators
Java-Codefrom UniDo
M4-Relational Model
M4-Case Model
M4-Concept
EditorM4-Conceptual Model
M4-Case Editor
54
Summary of Cases Involving Time
Db schemaindicating time attribute(s)
Inductive logic programming (RDT)SVM_light for classificationSVM for regressionExponential smoothing
SyntactictransformationsL E1
...LE4
Sliding windowsSummarizing windowsLevel changes
55
Summary
• Preprocessing is the key issue in data analysis!• Goal: Support users in making intelligent choices• Approach: Cases of best practice• View of a computer scientist:
– Scalability to very large databases– Meta-data driven processing
• Case studies on analysing data involving time
56
MiningMart Approach
• Manager -- end-userknows about the business case
• Database manager knows about the data
• Case designer -- power-userexpert in KDD
• Developer supplies (learning) operators