information-theoretic learning and data mining...information-theoretic learning and data mining...
TRANSCRIPT
![Page 1: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/1.jpg)
Information-theoretic Learning and Data Mining
Kenji Yamanishi
The University of Tokyo, JAPAN
Oct. 18th, 2009
![Page 2: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/2.jpg)
Contents
1.Information-theoretic Learning based Novelty Detection
2.Theory of Dynamic Model Selection
3.Applications of DMS to Data Mining
3-1. Masquerade Detection
3-2. Failure Detection
3-3. Topic Dynamics Detection
3-4. Network Structure Mining
4.Summary
![Page 3: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/3.jpg)
1.Information-theoretic Learning and Novelty Detection
![Page 4: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/4.jpg)
Data Mining Issues
Patten Discovery
Anomaly Detection
DataMIning
=Machine Learnig+ Pre/PostProcessing
Distributed and Heterogeneous Data
Knowledge Discovery
Marketing, CRM, SCM , ERP, Community Mgmnt.
Data
Risk Managment
Fraud Detection,Failure Detection, Compliance Mgmnt, Security
WEB
logs
RFID
cellar
media
sensor
Dynamic and Non-stationary Data
Future Prediction
0
50
100
150
200
250
300
DATE 2004/8/18
12:00
2004/8/19 0:00 2004/8/19
12:00
2004/8/20 0:00 2004/8/20
12:00
2004/8/21 0:00 2004/8/21
12:00
測定値
予測値・推定値
上限値
閾値
Mining from large amount of dynamic and heterogeneous data is a critical issue
![Page 5: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/5.jpg)
Information-theoretic Learning and Data Mining
Non-stationary DynamicInformation Sources
Stochastic Complexity Minimization…..Minimum Description Length Principle(Data Compression and Summarization)
Dynamics of Statistical Models
Structural Changes
Novelty Detection
Structured Information
Dynamic Data
Structural changes from dynamic data are discovered using stochastic complexity minimization
![Page 6: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/6.jpg)
2.Theory of Dynamic Model Selection
![Page 7: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/7.jpg)
Concept of Dynamic Model Selection
Issue: How do you track changes of statistical models
under non-stationary environments?
timeModel 1 Model 2
x1 xn… xtxt-1
…
Selecting a model sequence that explains the data best
Data
Model M1,……………..,M1, M2,……………………,M2
■Discounting Learning
■Windowing
■Dynamic Model Selection(DMS)
Methodologies
![Page 8: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/8.jpg)
Pattern 1
cdls
#patterns:1→ 2
time
Pattern 1
cdls
Pattern 2
shpsvi
Ex.1: UNIX Command Pattern Analysis
Tracking changes of the mixture size leads to detection of a new pattern
Probabilistic model (Mixture Model)
1 emacs 2 cp 3 bash 4 perl 5 :
1 cd2 ls3 emacs4 platex5 :
UNIX commandsequence
1 cd2 ls 3 emacs4 platex 5 :
1 emacs 2 make 3 ./a.exe4 :5 :
![Page 9: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/9.jpg)
Ex.2 Curve Fitting
Static ModelSelection
Dynamic Model
Selection
K: degree of polynomial
K=2 K=4K=5
Degree of curve changes over time
Tracking curve degrees which will change over time
![Page 10: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/10.jpg)
Related Work• Static model selection
AIC [Akaike 1973], BIC [Schwarz 1978], MDL [Rissanen 1978], MML [Wallace & Freeman 1978]
– Under stationarity assumption
• Tracking the best expert [Herbster & Warmuth 1998]
Derandomization [Vovk 1997][Cesanianchi Lugosi 2006]
– Best model (=“Best expert”) may change over time
– Cannot track the sequence of best models itself
・ MDL-Based Dynamic Model Selection under non-stationarity assumption [Yamanishi and Maruyama 2007]
・Switching [Ervan, Grunwald, Rooij 2008]
-Convergence as with MDL and convergence rate as with AIC
・Changing dependency detection [Fearnhead& Liu 2007]
![Page 11: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/11.jpg)
MDL-based Information Criterion for DMS
Evaluating goodness of model sequence from theMinimum Description Length (MDL) Principle [Rissanen 1978]
Code-length required by A to encode xn
and kn under prefix condition
Data Sequence
Model Sequence
Minimize w.r.t. kn for given xn
・ Batch DMS
・ Sequential DMS
(Kraft’s Inequality)
![Page 12: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/12.jpg)
Predictive Distribution
Dynamics of model changes is described using predictive dist.
![Page 13: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/13.jpg)
Switching Distribution
![Page 14: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/14.jpg)
Description of Model Transition
k+1
time
k
Model
Change point
Predictive code length Predictive code length
Code length for model transition
![Page 15: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/15.jpg)
Model transition probability : parameter),,|( 1 t
tt kkkP
Assumption: model probabilistically transits
n
t
t
tt
n
t
t
t
t
nn kkPkxxPkx1
1
1
1 )ˆ,|(log):|(log):(
Predictive code-length for data Predictive code- length for model seq.
DMS (Dynamic Model Selection) Criterion:
: estimate ofInput: xn=x1,….xn
Output: kn =k1,…. kn minimizing (xn: kn)
DMS(Dynamic Model Selction) Criterion
1 if ,2
,,1 and if ,2
1
,,1 and if ,1
):|(
1
11
11
1
tt
ttt
ttt
tt
kk
Kkkk
Kkkk
kkP
k
k-1
k
k+1
1
2/
2/
![Page 16: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/16.jpg)
Theorem 1: [Yamanishi and Maruyama IEEE Trans IT 07] There exists a batch DMS algorithm (DMS1) for switching distributions that runs in time O(Kn2) and the total code-length is upper bounded by
Coding Bound for Batch DMS for Switching Dist.
… … …
…
…
…
…
…
…
1
K
k At each time, at each stateselect an optimal pathfrom path sets selected at the latest time
POINT 1: Optimal path search based on Dynamic ProgrammingPOINT 2: Krichevsky and Trofimov Estimation of model transition probabilities
Model seq. complexityData complexity for switching
![Page 17: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/17.jpg)
Sequential DMS Problem
model sequence set
model sequence set restricted to (a,b)
so that the DMS criterion is minimum:
Predictive code-length of model seq.Data code-length for switching dist.
tt-B1
Window of size B
[Yamanishi and Sakurai WITMSE 2010]
![Page 18: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/18.jpg)
3. Applications of DMS to Data Mining
![Page 19: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/19.jpg)
Information leak
1 emacs 2 cp 3 bash 4 perl 5 :
time
1 cd2 ls3 emacs4 platex5 :
Program coding
emacs cp
make
Structuralchanges
Computer
Usage
Logs
1 cd2 ls 3 emacs4 platex 5 :
1 emacs 2 make 3 ./a.exe4 :5 :
Latent
Information
Program coding
emac cp
make
cp **.exe
Tracking the emergence of a new pattern leads to the discovery of a masquerader’s behavior
3.1. Masquerade Detection
![Page 20: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/20.jpg)
command session y j= (ps, cd, ls, cp, …)
= (y1, … yTj) length Tj
y1 ,...,yM : command patterns are modeled using HMM MixtureWish to detect masquerade by tracking changes in mixture size
K
k
kjkkj PP1
)|()|( yy
Each behavior pattern is represented by 1-st order HMM
),...,(
1
1 1
11
1
)|()|()()|(
jT
j j
xx
T
t
T
t
ttkttkkkjk xybxxaxP y
),...,( 1 jTxx :hidden states⇒Markov
K: Number of different behavior patterns (Mixture size)
Model
γk : initial prob. ak : transition prob. bk : event prob.
session
Time
[ Maruyama & Yamansihi ITW04]
![Page 21: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/21.jpg)
Session time
Session Stream
Anomalous Behavior
Detection
①On-line Discounting Learning Algorithm of each mixture model
③Universal Test Statistics BasedScoring with dynamically optimizedthreshold
②Dynamic Model SelectionLearning
Parameters
Selection of
Optimal Mixture Size K*
Anomaly Score
Computation
Learning
Parameters
Flow of Behavior Change Detection
time
Anomaly
Score
Mixture Size
Emerging PatternIdentification
HMM Mixture
K=Kmax
HMM Mixture
K=1
threshold
![Page 22: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/22.jpg)
0
1
2
3
0 200 400 600 800 1000 1200 1400
Masquerade data (t=820,…,849)
Pattern 1
file→ridstd
cat→rdistd
touch→tcsh
df→tcsh
rshd→rdistd
tcsh→rshd
rdistd→tcsh
0 0.2 0.4 0.6 0.8 1
Pattern 2
cat→post
cat→file
ln→ln
generic→ln
m→generic
file→post
awk→cat
0 0.2 0.4 0.6 0.8 1
t=820
Ex. User30
T-DMS1 detects the masquerader’s
command pattern at t=820
Top 7 transitions of each cluster (k=2 at t=820)
Masquerade Pattern DetectionMasquerade’s pattern is largely deviated from the normal
user’s pattern (mainly operating “remote shells” ).
![Page 23: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/23.jpg)
3.2 Failure Detection
ID Time stamp Event Severity
Att1 Att2 Message
## Nov 13 00:06:23: ERR bridge: !brdgursrv: queue is full. discarding
a message.
## Nov 13 10:15:00: WARN: INTR: ether2atm: Ethernet Slot 2L/1 Lock-
Up!!
## Nov 13 10:15:10: WARN: INTR: ether2atm: Ethernet Slot 2L/2
Lock-Up!!
## Nov 13 10:15:20: WARN: INTR: ether2atm: Ethernet Slot 2L/3
Lock-Up!!
■ Syslog….A sequence of events collected using the BSD syslog protocol
■ Includes information crucial to network failure
y1 ,...,yM : syslog patterns are also modeled using HMM MixtureWish to identify emerging failure patterns by tracking changes in mixture size.
[Yamansihi & Maruama KDD2005]
![Page 24: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/24.jpg)
Emerging Pattern Identification
Identify a new pattern by detecting the change point of mixture size
■ Goal: Track a new behavior pattern caused by a network failure
The component with smallest occurrence probability corresponds to a new pattern
lock-up failure
![Page 25: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/25.jpg)
Topic 1
Topic2
Topic 3
Topic1
Topic2
Topic 5Topic 4
Inquiries
Claims
DMS
Tracking changes of topic organization
Clustering
Topic Emergence
Detection
3-3. Topic Dynamics Detection
Text Processing at Contact Center
Topic 1
Activity
number
時間
Topic 2
Topic 5
Emergence of Topic 5
[Morinaga and Yamanishi KDD2004]
![Page 26: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/26.jpg)
time
dot = input document
p.d.f.component
= main topic
Gaussian mixture
Detecting emerging component
Online
learning
Dynamic Topic Modeling with Gaussian Mixture
![Page 27: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/27.jpg)
C42 is main.
-0.1
-0.08
-0.06
-0.04
-0.02
0
0.02
0.04
0.06
0.08
0.1
3/23 4/2 4/12 4/22 5/2 5/12 5/22
-4
-3
-2
-1
0
1
2
3
4
トピック “サービスZZZ”, “障害”がアクティブな期間
Date
Topic Activity Topic
“Transfer”Topic
“Failure of service X”
Emergence
of new topic
Disappearance
of old topic
Topic structure changes in contact center text stream
Topic Dynamics Detection at Contact Center
![Page 28: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/28.jpg)
0
10
20
30
40
50
2/23 3/14 4/3 4/23 5/13 time
•Total number of inquiries at a contact center is 1202
Changes of number of main topics over time
Changes of Topic Number
![Page 29: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/29.jpg)
3-4. Network Structure Mining
Time evolving
Tracking network hierarchy changes leads to new hub detection
ex. SNS networksDetection of influencers, criminal groups, new communities
Other examples:Physical networks, coauthor network, copurchase network
Time evolving
Emergence of new hubs
Emergence of new hierarchies
![Page 30: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/30.jpg)
Time evolving Time evolving
Change of partitioning structure
Emergence of a new cluster
Graph Structure Change Detection
Tracking a new graph cluster leads to discovery of a new community
①
②
③④⑤
⑥⑦
① ⑦①
⑦
・・・
Time series of
correlation matrix
[Hirose, Yamanishi, Nakata, Fujimaki KDD2009]
![Page 31: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/31.jpg)
4. Summary
• Novelty detection from dynamic data is a challenging issue in data mining.
• Dynamic model selection based on MDLprinciple is a novel method to detect structural changes in a data stream
• Applications cover masquerade detection, failure detection, topic emergence detection,network structure mining and are applicable to a wide range of areas.
![Page 32: Information-theoretic Learning and Data Mining...Information-theoretic Learning and Data Mining Non-stationary Dynamic Information Sources Stochastic Complexity Minimization …..Minimum](https://reader030.vdocuments.us/reader030/viewer/2022040720/5e2b9283d387d320a7452bea/html5/thumbnails/32.jpg)
References1) K.Yamanishi, J.Takeuchi, G.Williamas, and P.Milne: On-line Unsupervised Outlier
Detection Using Finite Mixtures with Discounting Learning Algorithms. Data Mining and Knowledge Discovery Journal, pp:275-300, May 2004, Volume 8, Issue 3.
2) J.Takeuchi and K.Yamanishi: A Unifying Framework for Detecting Outliers and Change-points from Time Series. IEEE Transactions on Knowledge and Data Engineering, 18:44, pp: 482-492, 2006.
3)S.Morinaga and K.Yamanishi: Tracking Dynamics of Topic Trends Using a Finite Mixture Model. Proceedings of the Tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2004), ACM Press, 2004.
4) K.Yamanishi and Y.Maruyama: Dynamic Syslog Mining for Network Failure Monitoring. Proceedings of the Eleventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2005), pp: 499-508, ACM Press, 2005.
5) K.Yamanishi and Y.Maruyama: Dynamic Model Selection with Its Applications to Novelty Detection. IEEE Transactions on Information Theory, pp:2180-2189, VOL 53, NO 6, June, 2007.
6) S.Hirose, K.Yamanishi, T.Nakata, R.Fujimaki: Network Anomaly Detection based on Eigen Equation Compression. Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD2009), 2009.
7) K.Yamanishi and E.Sakurai: Extensions and Probabilistic Analysis of Dynamic Model Selection.2010 Workshop on Information Theoretic Methods for Science and Engineering(WITMSE 2010), Tampere, Finland, August 2010.