advanced topics in data mining cse 8331 spring 2010 part i
DESCRIPTION
ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I. Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University. Data Mining Outline. EMM Stream Mining Text Mining Bioinformatics Mining. EMM Overview. - PowerPoint PPT PresentationTRANSCRIPT
04/22/2023 1
ADVANCED TOPICS IN DATA MININGCSE 8331 Spring 2010
Part I
Margaret H. DunhamDepartment of Computer Science and Engineering
Southern Methodist University
04/22/2023 2
Data Mining Outline
EMMStream MiningText MiningBioinformatics Mining
EMM OverviewTime Varying Discrete First Order Markov ModelNodes are clusters of real world states.Learning continues during prediction phase.Learning:
Transition probabilities between nodesNode labels (centroid of cluster)Nodes are added and removed as data arrives
04/22/2023 3
4
MMA first order Markov Chain is a finite or countably
infinite sequence of events {E1, E2, … } over discrete time points, where Pij = P(Ej | Ei), and at any time the future behavior of the process is based solely on the current state
A Markov Model (MM) is a graph with m vertices or states, S, and directed arcs, A, such that:
S ={N1,N2, …, Nm}, andA = {Lij | i 1, 2, …, m, j 1, 2, …, m} and Each arc, Lij =
<Ni,Nj> is labeled with a transition probability Pij = P(Nj | Ni).
04/22/2023
EMM Definition
Extensible Markov Model (EMM): at any time t, EMM consists of an MC with designated current node, Nn, and algorithms to modify it, where algorithms include:
EMMCluster, which defines a technique for matching between input data at time t + 1 and existing states in the MC at time t.
EMMIncrement algorithm, which updates MC at time t + 1 given the MC at time t and clustering measure result at time t + 1.
EMMDecrement algorithm, which removes nodes from the EMM when needed.
04/22/2023 5
EMM Cluster
Find closest node to incoming event.If none “close” create new nodeLabeling of cluster is centroid of
members in clusterO(n)
04/22/2023 6
7
EMMSimFind closest node to incoming event.If none “close” create new nodeLabeling of cluster is centroid/medoid of
members in clusterProblem
Nearest Neighbhor O(n)BIRCH O(lg n)
Requires second phase to recluster initial
04/22/2023
EMM Increment
04/22/2023 8
<18,10,3,3,1,0,0>
<17,10,2,3,1,0,0>
<16,9,2,3,1,0,0>
<14,8,2,3,1,0,0>
<14,8,2,3,0,0,0>
<18,10,3,3,1,1,0.>
1/3
N1
N2
2/3
N3
1/11/3
N1
N2
2/3
1/1
N3
1/1
1/2
1/3
N1
N2
2/3 1/2
1/2
N3
1/1
2/3
1/3
N1
N2
N1
2/21/1
N1
1
EMM Forget
04/22/2023 9
N2
N1 N3
N5 N6
2/2
1/3
1/3
1/3
1/2
N1 N3
N5 N6
1/61/6
1/6
1/31/3
1/3
04/22/2023 10
Data Mining OutlineEMM
Stream MiningData Stream OverviewData Stream ModelingData Stream ClusteringTRAC-DSAnomaly Detection
Text MiningBioinformatics Mining
11
Motivation
A growing number of applications generate streams of data.Computer network monitoring data Call detail records in telecommunications (Cisco VoIP 2003)Highway transportation traffic data (MnDot 2005)Online web purchase log records (JCPenney 2003, Travelociy 2005) Sensor network data (Ouse, Serwent 2002) Stock exchange, transactions in retail chains, ATM operations in banks,
credit card transactions.Data mining techniques play a key role in data models in
Data Stream Management System.
04/22/2023
12
BackgroundCharacteristics of data stream:Data are rawRecords may at a rapid rateHigh volume (possibly infinite) of continuous dataConcept drifts: Data distribution changes on the flyMultidimensionalTemporalityStream processing restrictions:Data modeling (synopsis)Single pass: Each record is examined at most onceBounded storage: Limited Memory for storing synopsisReal-time: Per record processing time must be low
04/22/2023
Haixun Wang, Jian Pei, Philip S. Yu, ICDE 2005; Keogh, ICDM’04
From Sensors to Streams
Data captured and sent by a set of sensors is usually referred to as “stream data”.
Real-time sequence of encoded signals which contain desired information. It is continuous, ordered (implicitly by arrival time or explicitly by timestamp or by geographic coordinates) sequence of items
Stream data is infinite - the data keeps coming.
04/22/2023 13
Suppose There Were MANY Sensors
Traditional line graphs would be very difficult to readRequirements for new visualization technique:
High level summary of dataHandle multiple sensors at onceContinuousTemporalSpatial
04/22/2023 14
04/22/2023 15
Spatiotemporal EnvironmentEvents arriving in a streamAt any time, t, we can view the state of the problem as represented by a vector of n numeric values:
Vt = <S1t, S2t, ..., Snt>
V2 V2 … V2
S2 S21 S22 … S2q
S1 S11 S12 … S1q
Sn Sn1 Sn2 … Snq
Time
Data Stream Management Systems (DSMS)
Software to facilitate querying and managing stream data.Retrieve the most recent information from the stream Data aggregation facilitates merging together multiple
streamsModeling stream data to “summarize” streamVisualization needed to observe in real-time the spatial and
temporal patterns and trends hidden in the data.
04/22/2023 16
DSMS Problems
Stream Management development in state similar to that of databases prior to 1970’sEach system/researcher looks at specific application or
systemNo standards concerning functionalityNo standard query language
Unreasonable to expect end users will access raw data, data in the DSMS, or even data at a summarized view
Domain experts need to “see” a higher level of data
04/22/2023 17
Data Stream ModelingSingle pass: Each record is examined at most onceBounded storage: Limited Memory for storing synopsisReal-time: Per record processing time must be lowSummarization (Synopsis )of dataUse data NOT SAMPLETemporal and SpatialDynamicContinuous (infinite stream)LearnForgetSublinear growth rate - Clustering
04/22/2023 1818
19
Problem with Markov ChainsThe required structure of the MC may not be certain at the model
construction time.As the real world being modeled by the MC changes, so should the
structure of the MC. Not scalable – grows linearly as number of events.Markov PropertyOur solution:
Extensible Markov Model (EMM)Cluster real world eventsAllow Markov chain to grow and shrink dynamically
04/22/2023
EMM Sublinear Growth Rate
04/22/2023 20 Minnesota Department of Transportation (MnDot)
Traditional Clustering
04/22/2023 21
TRAC-DS
04/22/2023 22
MotivationTemporal Ordering is a major feature of stream data. Many stream applications depend on this ordering
Prediction of future valuesAnomaly (rare event) detectionConcept drift
04/22/2023 23
Stream Clustering Requirements
Dynamic updating of the clusters Identify outliersBarbara:
compactness fast incremental processing
04/22/2023 24
Stream Clustering AlgorithmsLOCALSEARCH
Partitions stream into segments Clusters each segment individually by solving the k-medians problem Iteratively reclusters the resulting centers
CluStream Micro-clusters represented by summary statistics. Micro-clusters are handled online Micro-clusters merged offline
MONIC Evolution of clusters over time Cluster transitions over time
04/22/2023 25
TRAC-DS NOTETRAC-DS is not:
Another stream clustering algorithmTRAC-DS is:
A new way of looking at clusteringBuilt on top of an existing clustering
algorithmTRAC-DS may be used with any stream
clustering algorithm
04/22/2023 26
TRAC-DS Overview
04/22/2023 27
Data Stream ClusteringAt each point in time a data stream clustering ζ
is a partitioning of D', the data seen thus far.Instead of the whole partitions C1, C2,..., Ck only
synopses Cc1,Cc2,...,Cck are available and k is allowed to change over time.
The summaries Cci with i =1, 2,...,k typically contain information about the size, distribution and location of the data points in Ci.
04/22/2023 28
TRAC-DS DefinitionGiven a data stream clustering ζ, a temporal relationship
among clusters (TRAC-DS) overlays a data stream clustering ζ with a EMM M, in such a way that the following are satisfied: (1) There is a one-to-one correspondence between the clusters
in ζ and the states S in M. (2) A transition aij in the EMM M represents the probability that
given a data point in cluster i, the next data point in the data stream will belong to cluster j with i; j = 1; 2; : : : ; k.
(3) The EMM M is created online together with the data stream clustering
04/22/2023 29
Clustering OperationsA clustering operation is a function q : ζ × x →
ζ which is used by the data stream clustering algorithm to up date the clustering ζ given some additional information x which either is a new data point or other information (e.g., the number of the cluster to be deleted to be simplified the clustering).
04/22/2023 30
TRAC-DS OperationsA TRAC-DS operation is a function r : M × sc × y → M × sc
that updates the temporal relationship among clusters represented by the EMM M with states S given a current state sc S and additional information y and returns an ∈updated EMM and possibly a new current state.
In order to be able to dynamically update the EMM M we need to store a transition count matrix C. The count cij in C contains the number of times we observed a new point being assigned by the clustering algorithm to cluster i followed by a point being assigned to cluster j.
04/22/2023 31
Stream Clustering Operations *qassign point(ζ,x): Assigns the new data point x to an existing
cluster. qnew cluster(ζ,x): Create a new cluster. qremove cluster(ζ,x): Removes a cluster. Here x is the cluster, i, to
be removed. In this case the associated summary Cci is removed from ζ and k is decremented by one.
qmerge clusters(ζ,x): Merges two clusters.qfade clusters(ζ,x): Fades the cluster structure. qsplit clusters(ζ,x): Splits a cluster.
* Inspired by MONIC
04/22/2023 32
TRAC-DS Operationsrassign point(M,sc,y): Assigns the new data point to
the state representing an existing clusterrnew cluster(M,sc,y): Create a state for a new cluster. rremove cluster(M,sc,y): Removes state.rmerge clusters(M,sc,y): Merges two states. rfade clusters(M,sc,y): Fades the transition
probabilities using an exponential decay f(t)=2−λt
rsplit clusters(M,sc,y): Splits states. Y clustering operations.
04/22/2023 33
TRAC-DS Example
04/22/2023 34
TRAC-DS AdvantagesDynamicFlexible –
Use any Clustering AlgorithmSupports and clustering operations
ScalableMerges Clustering & Markov Modeling
04/22/2023 35
36
What is Anomaly?Event that is unusualEvent that doesn’t occur frequentlyPredefined eventWhat is unusual?What is deviation?
04/22/2023
37
What is Anomaly in Stream Data?Rare - Anomalous – SurprisingOut of the ordinaryNot outlier detection
No knowledge of data distribution Data is not staticMust take temporal and spatial values into accountMay be interested in sequence of events
Ex: Snow in upstate New York is not an anomalySnow in upstate New York in June is rare
Rare events may change over time
04/22/2023
38
Statistical View of AnomalyOutlierData item that is outside the normal distribution of the dataIdentify by Box Plot
04/22/2023
Image from Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002.
39
Statistical View of AnomalyIdentify by looking at distributionTHIS DOES NOT WORK with stream data
04/22/2023
Image from www.wikipedia.org, Normal distribution.
40
Data Mining View of AnomalyClassification Problem
Build classifier from training dataProblem is that training data shows what is NOT an
anomalyThus an anomaly is anything that is not viewed as
normal by the classification techniqueMUST build dynamic classifier
Identify anomalous behaviorSignatures of what anomalous behavior looks likeInput data is identified as anomaly if it is similar
enough to one of these signaturesMixed – Classification and Signature
04/22/2023
41
EMM AdvantagesDynamic AdaptableUse of clusteringLearns rare eventScalable:
Growth of EMM is not linear on size of data.Hierarchical feature of EMM
Creation/evaluation quasi-real timeDistributed / Hierarchical extensions
04/22/2023
42
Growth of EMM
04/22/2023
0
100
200
300
400
500
600
700
800
1 80 159
238
317
396
475
554
633
712
791
870
949
1028
1107
1186
1265
1344
1423
1502
number of input data (total 1574)
num
ber o
f sta
te in
mod
el threshold 0.994
threshold 0.995
threshold 0.996
threshold 0.997
threshold 0.998
Servent Data
TRAC-DS Approach to Detect Anomalies
By learning what is normal, the model can predict what is not
Normal is based on likelihood of occurrenceUse TRAC-DS to build clusters and behavior between
clustersWe view a rare event as:
Unusual event Transition between events states which does not
frequently occur.Continue learning
04/22/2023 43
Determining RareOccurrence Frequency (OFi) of an EMM state Si
is normalized count of state:
Normalized Transition Probability (NTPmn), from one state, Sm, to another, Sn, is a normalized transition Count:
04/22/2023 44
i
iii nnOF /
i
inmnm nCNTP )/()( ,,
45
EMMRareEMMRare algorithm indicates if the current input
event is rare. Using a threshold occurrence percentage, the input event is determined to be rare if either of the following occurs:The frequency of the node at time t+1 is below
this threshold The updated transition probability of the MC
transition from node at time t to the node at t+1 is below the threshold
04/22/2023