advanced topics in data mining cse 8331 spring 2010 part i

45
ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University 1 06/28/2022

Upload: stamos

Post on 25-Feb-2016

53 views

Category:

Documents


0 download

DESCRIPTION

ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I. Margaret H. Dunham Department of Computer Science and Engineering Southern Methodist University. Data Mining Outline. EMM Stream Mining Text Mining Bioinformatics Mining. EMM Overview. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

04/22/2023 1

ADVANCED TOPICS IN DATA MININGCSE 8331 Spring 2010

Part I

Margaret H. DunhamDepartment of Computer Science and Engineering

Southern Methodist University

Page 2: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

04/22/2023 2

Data Mining Outline

EMMStream MiningText MiningBioinformatics Mining

Page 3: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

EMM OverviewTime Varying Discrete First Order Markov ModelNodes are clusters of real world states.Learning continues during prediction phase.Learning:

Transition probabilities between nodesNode labels (centroid of cluster)Nodes are added and removed as data arrives

04/22/2023 3

Page 4: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

4

MMA first order Markov Chain is a finite or countably

infinite sequence of events {E1, E2, … } over discrete time points, where Pij = P(Ej | Ei), and at any time the future behavior of the process is based solely on the current state

A Markov Model (MM) is a graph with m vertices or states, S, and directed arcs, A, such that:

S ={N1,N2, …, Nm}, andA = {Lij | i 1, 2, …, m, j 1, 2, …, m} and Each arc, Lij =

<Ni,Nj> is labeled with a transition probability Pij = P(Nj | Ni).

04/22/2023

Page 5: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

EMM Definition

Extensible Markov Model (EMM): at any time t, EMM consists of an MC with designated current node, Nn, and algorithms to modify it, where algorithms include:

EMMCluster, which defines a technique for matching between input data at time t + 1 and existing states in the MC at time t.

EMMIncrement algorithm, which updates MC at time t + 1 given the MC at time t and clustering measure result at time t + 1.

EMMDecrement algorithm, which removes nodes from the EMM when needed.

04/22/2023 5

Page 6: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

EMM Cluster

Find closest node to incoming event.If none “close” create new nodeLabeling of cluster is centroid of

members in clusterO(n)

04/22/2023 6

Page 7: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

7

EMMSimFind closest node to incoming event.If none “close” create new nodeLabeling of cluster is centroid/medoid of

members in clusterProblem

Nearest Neighbhor O(n)BIRCH O(lg n)

Requires second phase to recluster initial

04/22/2023

Page 8: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

EMM Increment

04/22/2023 8

<18,10,3,3,1,0,0>

<17,10,2,3,1,0,0>

<16,9,2,3,1,0,0>

<14,8,2,3,1,0,0>

<14,8,2,3,0,0,0>

<18,10,3,3,1,1,0.>

1/3

N1

N2

2/3

N3

1/11/3

N1

N2

2/3

1/1

N3

1/1

1/2

1/3

N1

N2

2/3 1/2

1/2

N3

1/1

2/3

1/3

N1

N2

N1

2/21/1

N1

1

Page 9: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

EMM Forget

04/22/2023 9

N2

N1 N3

N5 N6

2/2

1/3

1/3

1/3

1/2

N1 N3

N5 N6

1/61/6

1/6

1/31/3

1/3

Page 10: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

04/22/2023 10

Data Mining OutlineEMM

Stream MiningData Stream OverviewData Stream ModelingData Stream ClusteringTRAC-DSAnomaly Detection

Text MiningBioinformatics Mining

Page 11: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

11

Motivation

A growing number of applications generate streams of data.Computer network monitoring data Call detail records in telecommunications (Cisco VoIP 2003)Highway transportation traffic data (MnDot 2005)Online web purchase log records (JCPenney 2003, Travelociy 2005) Sensor network data (Ouse, Serwent 2002) Stock exchange, transactions in retail chains, ATM operations in banks,

credit card transactions.Data mining techniques play a key role in data models in

Data Stream Management System.

04/22/2023

Page 12: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

12

BackgroundCharacteristics of data stream:Data are rawRecords may at a rapid rateHigh volume (possibly infinite) of continuous dataConcept drifts: Data distribution changes on the flyMultidimensionalTemporalityStream processing restrictions:Data modeling (synopsis)Single pass: Each record is examined at most onceBounded storage: Limited Memory for storing synopsisReal-time: Per record processing time must be low

04/22/2023

Haixun Wang, Jian Pei, Philip S. Yu, ICDE 2005; Keogh, ICDM’04

Page 13: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

From Sensors to Streams

Data captured and sent by a set of sensors is usually referred to as “stream data”.

Real-time sequence of encoded signals which contain desired information. It is continuous, ordered (implicitly by arrival time or explicitly by timestamp or by geographic coordinates) sequence of items

Stream data is infinite - the data keeps coming.

04/22/2023 13

Page 14: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

Suppose There Were MANY Sensors

Traditional line graphs would be very difficult to readRequirements for new visualization technique:

High level summary of dataHandle multiple sensors at onceContinuousTemporalSpatial

04/22/2023 14

Page 15: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

04/22/2023 15

Spatiotemporal EnvironmentEvents arriving in a streamAt any time, t, we can view the state of the problem as represented by a vector of n numeric values:

Vt = <S1t, S2t, ..., Snt>

V2 V2 … V2

S2 S21 S22 … S2q

S1 S11 S12 … S1q

Sn Sn1 Sn2 … Snq

Time

Page 16: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

Data Stream Management Systems (DSMS)

Software to facilitate querying and managing stream data.Retrieve the most recent information from the stream Data aggregation facilitates merging together multiple

streamsModeling stream data to “summarize” streamVisualization needed to observe in real-time the spatial and

temporal patterns and trends hidden in the data.

04/22/2023 16

Page 17: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

DSMS Problems

Stream Management development in state similar to that of databases prior to 1970’sEach system/researcher looks at specific application or

systemNo standards concerning functionalityNo standard query language

Unreasonable to expect end users will access raw data, data in the DSMS, or even data at a summarized view

Domain experts need to “see” a higher level of data

04/22/2023 17

Page 18: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

Data Stream ModelingSingle pass: Each record is examined at most onceBounded storage: Limited Memory for storing synopsisReal-time: Per record processing time must be lowSummarization (Synopsis )of dataUse data NOT SAMPLETemporal and SpatialDynamicContinuous (infinite stream)LearnForgetSublinear growth rate - Clustering

04/22/2023 1818

Page 19: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

19

Problem with Markov ChainsThe required structure of the MC may not be certain at the model

construction time.As the real world being modeled by the MC changes, so should the

structure of the MC. Not scalable – grows linearly as number of events.Markov PropertyOur solution:

Extensible Markov Model (EMM)Cluster real world eventsAllow Markov chain to grow and shrink dynamically

04/22/2023

Page 20: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

EMM Sublinear Growth Rate

04/22/2023 20 Minnesota Department of Transportation (MnDot)

Page 21: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

Traditional Clustering

04/22/2023 21

Page 22: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

TRAC-DS

04/22/2023 22

Page 23: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

MotivationTemporal Ordering is a major feature of stream data. Many stream applications depend on this ordering

Prediction of future valuesAnomaly (rare event) detectionConcept drift

04/22/2023 23

Page 24: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

Stream Clustering Requirements

Dynamic updating of the clusters Identify outliersBarbara:

compactness fast incremental processing

04/22/2023 24

Page 25: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

Stream Clustering AlgorithmsLOCALSEARCH

Partitions stream into segments Clusters each segment individually by solving the k-medians problem Iteratively reclusters the resulting centers

CluStream Micro-clusters represented by summary statistics. Micro-clusters are handled online Micro-clusters merged offline

MONIC Evolution of clusters over time Cluster transitions over time

04/22/2023 25

Page 26: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

TRAC-DS NOTETRAC-DS is not:

Another stream clustering algorithmTRAC-DS is:

A new way of looking at clusteringBuilt on top of an existing clustering

algorithmTRAC-DS may be used with any stream

clustering algorithm

04/22/2023 26

Page 27: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

TRAC-DS Overview

04/22/2023 27

Page 28: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

Data Stream ClusteringAt each point in time a data stream clustering ζ

is a partitioning of D', the data seen thus far.Instead of the whole partitions C1, C2,..., Ck only

synopses Cc1,Cc2,...,Cck are available and k is allowed to change over time.

The summaries Cci with i =1, 2,...,k typically contain information about the size, distribution and location of the data points in Ci.

04/22/2023 28

Page 29: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

TRAC-DS DefinitionGiven a data stream clustering ζ, a temporal relationship

among clusters (TRAC-DS) overlays a data stream clustering ζ with a EMM M, in such a way that the following are satisfied: (1) There is a one-to-one correspondence between the clusters

in ζ and the states S in M. (2) A transition aij in the EMM M represents the probability that

given a data point in cluster i, the next data point in the data stream will belong to cluster j with i; j = 1; 2; : : : ; k.

(3) The EMM M is created online together with the data stream clustering

04/22/2023 29

Page 30: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

Clustering OperationsA clustering operation is a function q : ζ × x →

ζ which is used by the data stream clustering algorithm to up date the clustering ζ given some additional information x which either is a new data point or other information (e.g., the number of the cluster to be deleted to be simplified the clustering).

04/22/2023 30

Page 31: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

TRAC-DS OperationsA TRAC-DS operation is a function r : M × sc × y → M × sc

that updates the temporal relationship among clusters represented by the EMM M with states S given a current state sc S and additional information y and returns an ∈updated EMM and possibly a new current state.

In order to be able to dynamically update the EMM M we need to store a transition count matrix C. The count cij in C contains the number of times we observed a new point being assigned by the clustering algorithm to cluster i followed by a point being assigned to cluster j.

04/22/2023 31

Page 32: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

Stream Clustering Operations *qassign point(ζ,x): Assigns the new data point x to an existing

cluster. qnew cluster(ζ,x): Create a new cluster. qremove cluster(ζ,x): Removes a cluster. Here x is the cluster, i, to

be removed. In this case the associated summary Cci is removed from ζ and k is decremented by one.

qmerge clusters(ζ,x): Merges two clusters.qfade clusters(ζ,x): Fades the cluster structure. qsplit clusters(ζ,x): Splits a cluster.

* Inspired by MONIC

04/22/2023 32

Page 33: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

TRAC-DS Operationsrassign point(M,sc,y): Assigns the new data point to

the state representing an existing clusterrnew cluster(M,sc,y): Create a state for a new cluster. rremove cluster(M,sc,y): Removes state.rmerge clusters(M,sc,y): Merges two states. rfade clusters(M,sc,y): Fades the transition

probabilities using an exponential decay f(t)=2−λt

rsplit clusters(M,sc,y): Splits states. Y clustering operations.

04/22/2023 33

Page 34: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

TRAC-DS Example

04/22/2023 34

Page 35: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

TRAC-DS AdvantagesDynamicFlexible –

Use any Clustering AlgorithmSupports and clustering operations

ScalableMerges Clustering & Markov Modeling

04/22/2023 35

Page 36: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

36

What is Anomaly?Event that is unusualEvent that doesn’t occur frequentlyPredefined eventWhat is unusual?What is deviation?

04/22/2023

Page 37: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

37

What is Anomaly in Stream Data?Rare - Anomalous – SurprisingOut of the ordinaryNot outlier detection

No knowledge of data distribution Data is not staticMust take temporal and spatial values into accountMay be interested in sequence of events

Ex: Snow in upstate New York is not an anomalySnow in upstate New York in June is rare

Rare events may change over time

04/22/2023

Page 38: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

38

Statistical View of AnomalyOutlierData item that is outside the normal distribution of the dataIdentify by Box Plot

04/22/2023

Image from Data Mining, Introductory and Advanced Topics, Prentice Hall, 2002.

Page 39: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

39

Statistical View of AnomalyIdentify by looking at distributionTHIS DOES NOT WORK with stream data

04/22/2023

Image from www.wikipedia.org, Normal distribution.

Page 40: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

40

Data Mining View of AnomalyClassification Problem

Build classifier from training dataProblem is that training data shows what is NOT an

anomalyThus an anomaly is anything that is not viewed as

normal by the classification techniqueMUST build dynamic classifier

Identify anomalous behaviorSignatures of what anomalous behavior looks likeInput data is identified as anomaly if it is similar

enough to one of these signaturesMixed – Classification and Signature

04/22/2023

Page 41: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

41

EMM AdvantagesDynamic AdaptableUse of clusteringLearns rare eventScalable:

Growth of EMM is not linear on size of data.Hierarchical feature of EMM

Creation/evaluation quasi-real timeDistributed / Hierarchical extensions

04/22/2023

Page 42: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

42

Growth of EMM

04/22/2023

0

100

200

300

400

500

600

700

800

1 80 159

238

317

396

475

554

633

712

791

870

949

1028

1107

1186

1265

1344

1423

1502

number of input data (total 1574)

num

ber o

f sta

te in

mod

el threshold 0.994

threshold 0.995

threshold 0.996

threshold 0.997

threshold 0.998

Servent Data

Page 43: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

TRAC-DS Approach to Detect Anomalies

By learning what is normal, the model can predict what is not

Normal is based on likelihood of occurrenceUse TRAC-DS to build clusters and behavior between

clustersWe view a rare event as:

Unusual event Transition between events states which does not

frequently occur.Continue learning

04/22/2023 43

Page 44: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

Determining RareOccurrence Frequency (OFi) of an EMM state Si

is normalized count of state:

Normalized Transition Probability (NTPmn), from one state, Sm, to another, Sn, is a normalized transition Count:

04/22/2023 44

i

iii nnOF /

i

inmnm nCNTP )/()( ,,

Page 45: ADVANCED TOPICS IN DATA MINING CSE 8331 Spring 2010 Part I

45

EMMRareEMMRare algorithm indicates if the current input

event is rare. Using a threshold occurrence percentage, the input event is determined to be rare if either of the following occurs:The frequency of the node at time t+1 is below

this threshold The updated transition probability of the MC

transition from node at time t to the node at t+1 is below the threshold

04/22/2023