advances in bayesian learning

39
1 Machine Learning in Performance Management Irina Rish IBM T.J. Watson Research Center January 24, 2001

Upload: butest

Post on 05-Dec-2014

645 views

Category:

Documents


1 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Advances in Bayesian Learning

1

Machine Learning in Performance Management

Irina RishIBM T.J. Watson Research

CenterJanuary 24, 2001

Page 2: Advances in Bayesian Learning

Irina Rish, IBM TJWRC2

Outline

Introduction

Machine learning applications in Performance Management

Bayesian learning tools: extending ABLE

Advancing theory

Summary and future directions

Page 3: Advances in Bayesian Learning

Irina Rish, IBM TJWRC3

Pattern discovery, classification, diagnosis and prediction

Pattern discovery, classification, diagnosis and prediction

Learning problems: examples

System event mining

Even

ts f

rom

hosts

Time

End-user transaction recognition

5R5R3R 2R 2R1R2R

Remote Procedure Calls (RPCs)

BUY?SELL?

OPEN_DB?SEARCH?

Transaction1 Transaction2

Page 4: Advances in Bayesian Learning

Irina Rish, IBM TJWRC4

Approach: Bayesian learning

Numerous important applications: Medicine Stock market Bio-informatics eCommerce Military ………

classmax

Diagnosis:P(cause|

symptom)=?

Diagnosis:P(cause|

symptom)=?

Learn (probabilistic) dependency modelsLearn (probabilistic) dependency models

C

S

B

DX

P(S)

P(B|S)

P(X|C,S)

P(C|S)

P(D|C,B)

Prediction:P(symptom|

cause)=?

Prediction:P(symptom|

cause)=?

Bayesian networks

Pattern classification:

P(class|data)=?

Pattern classification:

P(class|data)=?

Page 5: Advances in Bayesian Learning

Irina Rish, IBM TJWRC5

Outline

Introduction

Machine-learning applications in Performance Management

Transaction Recognition In progress: Event Mining;

Probe Placement; etc.

Bayesian learning tools: extending ABLE

Advancing theory

Summary and future directions

Page 6: Advances in Bayesian Learning

Irina Rish, IBM TJWRC6

End-User Transaction Recognition: why is it important?

Client Workstation

End-UserTransactions (EUT)

Remote ProcedureCalls (RPCs)

Server (Web, DB, Lotus Notes)

Session (connection)

Realistic workload models (for testing performance)

Resource management (anticipating requests)

Quantifying end-user perception of performance (response times)

OpenDB Search SendMail

Examples: Lotus Notes, Web/eBusiness (on-line stores, travel agencies, trading):database transactions, buy/sell, search, email, etc.

RPCs?

Page 7: Advances in Bayesian Learning

Irina Rish, IBM TJWRC7

Why is it hard? Why learn from data?

MoveMsgToFolder

FindMailByKey

1. OPEN_COLLECTION2. UDATE_COLLECTION3. DB_REPLINFO_GET4. GET_MOD_NOTES5. READ_ENTRIES6. OPEN_COLLECTION7. FIND_BY_KEY8. READ_ENTRIES

EUTs

RPCs

Example: EUTs and RPCs in Lotus Notes

Many RPC and EUT types (92 RPCs and 37 EUTs) Large (unlimited) data sets (10,000+ Tx inst.) Manual classification of a data subset took about

a month Non-deterministic and unknown EUT RPC

mapping: “Noise” sources - client/server states No client-side instrumentation – unknown EUT

boundaries

Page 8: Advances in Bayesian Learning

Irina Rish, IBM TJWRC8

Problem 2: both segment and label (EUT recognition)

1 2 1 3 4 1 2 31 2 1 3 2 31 2 1 2 31 2 1 2 4

Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx2Tx2Tx2Tx2Tx2Tx2Tx2Tx2

Unsegmented RPC's

Segmented RPC's and Labeled Tx's Tx2

Our approach: Classification + Segmentation

Problem 1: label segmented data (classification)

Labeled Tx's

Segmented RPC's

Tx3Tx2

1 3 31 31 31 31 31 31 3 1111111122222222 1 3

Tx1 Tx 1

1 2 3 41 2 3 41 2 3 41 2 3 41 2 3 41 2 3 41 2 3 41 2 3 4

Tx3

11111111

(similar to text classification)

(similar to speech understanding, image segmentation)

Page 9: Advances in Bayesian Learning

Irina Rish, IBM TJWRC9

How to represent transactions? “Feature vectors”

ijij p)T|1P(R :Bernoulli

M

1j

nijM

1j ij

iiMi1ijp

!n

n!)T|n,...,P(n :lMultinomia

)p(1p)T|P(n :Geometric ijn

ijiijij

)p(1p)T|P(n :Geometric Shifted ijsn

ijiijijij

i typeof nTransactio

5R 5R3R 2R 2R1R2R

RPC counts ...) 0, 2, 0, 1, 3, (1,f

...) 0, 1, 0, 1, 1, (1,f RPC occurrences

geometric shifted:)( data to fit Best 2

Page 10: Advances in Bayesian Learning

Irina Rish, IBM TJWRC10

Classification scheme

RPCs labeledwith EUTs

LearningClassifier

Unlabeled RPCs

EUTs

Training phase

FeatureExtraction Classifier

Training data:

Operation phase

“Test” data: FeatureExtraction Classification

Page 11: Advances in Bayesian Learning

Irina Rish, IBM TJWRC11

Our classifier: naïve Bayes (NB)

counts RPC or occurences RPC

:Features EUT)|P(f1 EUT)|P(f2 EUT)|P(fn

1f feature nf feature2f feature

EUTP(EUT)

2. Classification: given (unlabeled) instance , choose most likely class:

1. Training: estimate parameters and (e.g., ML-estimates) EUT)|P(fi

P(EUT)

)f,...,f|P(EUTmax arg EUT n1ii

(Bayesian decision rule)

)f,...,(f n1

Simplifying (“naïve”) assumption:feature independence given class

Page 12: Advances in Bayesian Learning

Irina Rish, IBM TJWRC12

Classification results on Lotus CoC data

Significant improvement over baseline classifier (75%) NB is simple, efficient, and comparable to the state-of-the-art classifiers:

SVM – 85-87%, Decision Tree – 90-92% Best-fit distribution (shift. geom) - not necessarily best classifier! (?)

Baseline classifier:Always selects most-frequent transaction

Accu

racy

Training set size

NB + Bernoulli, mult. or geom.

NB + shifted geom.

2% 87

3% 79

1%10

Page 13: Advances in Bayesian Learning

Irina Rish, IBM TJWRC13

Transaction recognition:segmentation + classification

)i,...,(iV m1 find : taskonSegmentati

n1jjpm1 ...rr ...r...r......rr

11 i 2i 3i ki

n ,1jRjR ,1

n

Naive Bayes classifier

)maxmax 10

kjT

jkj

k P(T,Rαα

),(maxarg )|(maxarg* 11 nV

nV

RVPRVPV onsegmentati probable most :Objective

Dynamic programming (Viterbi search)

(Recursive)DP equation:

Page 14: Advances in Bayesian Learning

Irina Rish, IBM TJWRC14

Transaction recognition results

Third bestbestMultinomial

Fourth bestbestGeometric

bestworstShift. Geom.

Second bestbestBernoulli

SegmentationClassificationModel

64%- geom. shifted

geom. Multinom.,

Bernoulli

Accu

racy

Training set size

Good EUT recognition accuracy: 64% (harder problem than classification!) Reversed order of results: best classifier - not necessarily best recognizer! (?) further research!

Page 15: Advances in Bayesian Learning

Irina Rish, IBM TJWRC15

EUT recognition: summary A novel approach: learning EUTs from RPCs Patent, conference paper (AAAI-2000), prototype system Successful results on Lotus Notes data (Lotus CoC):

Classification – naive Bayes (up to 87% accuracy) EUT recognition – Viterbi+Bayes (up to 64% accuracy)

Work in progress: Better feature selection (RPC subsequences?) Selecting “best classifier” for segmentation task Learning more sophisticated classifiers (Bayesian networks) Information-theoretic approach to segmentation (MDL)

Page 16: Advances in Bayesian Learning

Irina Rish, IBM TJWRC16

Outline

Introduction

Machine-learning applications in Performance Management

Transaction Recognition In progress: Event Mining;

Probing Strategy; etc.

Bayesian learning tools: extending ABLE

Advancing theory

Summary and future directions

Page 17: Advances in Bayesian Learning

Irina Rish, IBM TJWRC17

Event Mining:analyzing system event sequences

Example: USAA data 858 hosts, 136 event types 67184 data points: (13 days, by sec) Event examples:

High-severity events: 'Cisco_Link_Down‘, 'chassisMinorAlarm_On‘, etc. Low-severity events: 'tcpConnectClose‘, 'duplicate_ip‘, etc.

Even

ts f

rom

hosts

Time (sec)

What is it? Why is it important? learning system behavior patterns for better performance management

Why is it hard?

large complex systems (networks)

with many dependencies; prior models not always available many events/hosts, data sets: huge and constantly growing

Page 18: Advances in Bayesian Learning

Irina Rish, IBM TJWRC18

???

Event1

Event

N

1. Learning event dependency models

Event2

EventM

Important issue: incremental learning from data streams

Current approach: learn dynamic probabilistic graphical models (temporal, or dynamic Bayes nets) Predict:

time to failure event co-occurrence existence of hidden nodes – “root causes”

Recognize sequence of high-level system states: unsupervised version of EUT recognition problem

Page 19: Advances in Bayesian Learning

Irina Rish, IBM TJWRC19

2. Clustering hosts by their history

“Problematic” hosts “Silent” hosts

group hosts w/ similar event sequences: what is appropriate similarity (“distance”) metric? One example:

e.g., distance between “compressed” sequences – event distribution models:

type evente

(e),P S Seq.(e),P S Seq.

12

11

distance) Leibler-(Kullback entropy relative is

(e)P

(e)P(e)logP)P||D(P where),P||D(P)S,S dist(

e 2

11212121

Page 20: Advances in Bayesian Learning

Irina Rish, IBM TJWRC20

Probing strategy (EPP)

Objectives: find probe frequency F that minimizes 1. E (Tprobe-Tstart) - failure detection, or

2. E( total “failure” time – total “estimated” failure time) -

gives accurate performance estimate Constraints on additional load induced by probes: L(F) <

MaxLoad

maxR

time

resp

on

se t

ime

s1v e

1v e2vs

2v Ts2t

e2t

Availabilityviolations Probes

Page 21: Advances in Bayesian Learning

Irina Rish, IBM TJWRC21

Outline

Introduction

Machine-learning applications in Performance Management

Bayesian learning tools: extending ABLE

Advancing theory

Summary and future directions

Page 22: Advances in Bayesian Learning

ABLE: Agent Building and Learning Environment

Page 23: Advances in Bayesian Learning

Irina Rish, IBM TJWRC23

What is ABLE? What is my contribution?

A JAVA toolbox for building reasoning and learning agents

Provides: visual environment, boolean and fuzzy rules, neural networks, genetic search

My contributions: naïve Bayes classifier (batch and

incremental) Discretization Future releases:

General Bayesian learning and inference tools

Available at AlphaWorks: www.alphaWorks.ibm.com/tech Project page: w3.rchland.ibm.com/projects/ABLE

Page 24: Advances in Bayesian Learning

How does it work?

Page 25: Advances in Bayesian Learning

Irina Rish, IBM TJWRC25

Who is using Naïve Bayes tools?Impact on other IBM projects

Video Character Recognition:

(w/ C. Dorai): Naïve Bayes: 84% accuracy Better than SVM on some pairs

of characters (aver. SVM = 87%) Current work: combining Naïve

Bayes with SVMs

Environmental data analysis:(w/ Yuan-Chi Chang)

Learning mortality rates using data on air pollutants

Naïve Bayes is currently being evaluated

Performance management: Event mining – in progress EUT recognition – successful

results

Page 26: Advances in Bayesian Learning

Irina Rish, IBM TJWRC26

Outline

Introduction

Machine-learning in Performance Management

Bayesian learning tools: extending ABLE Advancing theory

analysis of naïve Bayes classifier

inference in Bayesian Networks

Summary and future directions

Page 27: Advances in Bayesian Learning

Irina Rish, IBM TJWRC27

Why Naïve Bayes does well? And when?

1f feature nf feature2f feature

Class

When independence assumptions do not hurt classification?

Class-conditional feature independence:

jj class)|P(fclass)|(fP ˆ

Unrealistic assumption! But why/when it works?

Intuition: wrong probability estimates wrong classification!

optclass

True

NB estimate

P(c

lass

|f)

f)|(classP̂

f)|P(class

NBclassNaïve Bayes: f)|(classPmax arg i

i

ˆ

optclassBayes-optimal:f)|P(classmax arg i

i

Page 28: Advances in Bayesian Learning

Irina Rish, IBM TJWRC28

Case 1: functional dependencies

Lemma 1: Naïve Bayes is optimal when features are functionally dependent given classProof

:

)(xP )(xP

))(x(fP))(x(fP )(xP)(xP

)x,...,(xP)x,...,(xP

,-}{cn1,..,i ),(xfx

-)P(C)P(C

-)C|P(X (X)P ),C|P(X (X)P ,-},{C

n1

n1

n

1i1i

n

1i1i

n

1ii

n

1ii

n1n1

1ii

-

:

:

for and

for :dependence functional 2. :priors uniform 1.

:AssumeLet

rule decisionBayes Naive

rule decision optimalBayes

)(xP)(xP 11

)(xP)(xP 11

Page 29: Advances in Bayesian Learning

Irina Rish, IBM TJWRC29

0 100 200 300 400 5000

0.01

0.02

0.03

0.04

0.05

0.06

δ

ii af

class)|P(fi

Lemma 2: Naïve Bayes is a “good approximation” for “almost-functional” dependencies

) assumption nce (independemarginals of product joint

Case 2: “almost-functional” (low-entropy) distributions

Related practical examples: RPC occurrences in EUTs: often almost-deterministic (and NB does well) Successful “local inference” in almost-deterministic Bayesian networks

(Turbo coding, “mini-buckets” – see Dechter&Rish 2000)

Formally:

δ1)afP(

or ,δ1)aP(f ii

then

If

n1,...,i for ,

nδ |)aP(f)afP(| i

ii

Page 30: Advances in Bayesian Learning

Irina Rish, IBM TJWRC30

Experimental results support theory

1. Less “noise” (smaller ) => NB closer to optimal

δ

δ-1Random problem generator: uniform P(class); random P(f|class):

1. A randomly selected entry in P(f|class) is assigned2. The rest of entries – uniform random sampling + normalization

2. Feature dependence does NOT correlate with NB error

Page 31: Advances in Bayesian Learning

Irina Rish, IBM TJWRC31

Outline Introduction

Machine-learning in Performance Management Transaction Recognition

Event Mining

Bayesian learning tools: extending ABLE Advancing theory

analysis of naïve Bayes classifier

inference in Bayesian Networks

Summary and future directions

Page 32: Advances in Bayesian Learning

Irina Rish, IBM TJWRC32

From Naïve Bayes to Bayesian Networks

class)|P(f1 class)|P(f2class)|P(fn

1f feature nf feature2f feature

ClassNaïve Bayes model:independent features given class

Bayesian network (BN) model: Any joint probability distributions

lung Cancer

Smoking

X-ray

Bronchitis

DyspnoeaP(D|C,B)

P(B|S)

P(S)

P(X|C,S)

P(C|S)

= P(S) P(C|S) P(B|S) P(X|C,S) P(D|C,B)

P(S, C, B, X, D)=

CPD: C B D=0 D=10 0 0.1 0.90 1 0.7 0.31 0 0.8 0.21 1 0.9 0.1

Query: P (lung cancer=yes | smoking=no, dyspnoea=yes ) = ?

Page 33: Advances in Bayesian Learning

Irina Rish, IBM TJWRC33

Example: Printer Troubleshooting (Microsoft Windows 95)

Print OutputOK

Correct Driver

UncorruptedDriver

CorrectPrinter Path

Net CableConnected

Net/LocalPrinting

Printer On and Online

CorrectLocal Port

Correct Printer

Selected

Local CableConnected

ApplicationOutput OK

PrintSpooling On

Correct Driver

Settings

Printer MemoryAdequate

NetworkUp

SpooledData OK

GDI DataInput OK

GDI Data Output OK

PrintData OK

PC to PrinterTransport OK

PrinterData OK

SpoolProcess OK

NetPath OK

LocalPath OK

PaperLoaded

Local DiskSpace Adequate

[Heckerman, 95]

Page 34: Advances in Bayesian Learning

Irina Rish, IBM TJWRC34

MEU Decision-making

(given utility function)

MEU Decision-making

(given utility function)

How to use Bayesian networks?

Applications: Medicine Stock market Bio-informatics eCommerce Performance management etc.

1C 2C

cause

symptomsymptom

cause

Classification: P(class|data)=?

Classification: P(class|data)=?class

max

Prediction:P(symptom|

cause)=?

Prediction:P(symptom|

cause)=?

Diagnosis:P(cause|

symptom)=?

Diagnosis:P(cause|

symptom)=?

NP-complete inference problems

Approximate algorithms

Page 35: Advances in Bayesian Learning

Irina Rish, IBM TJWRC35

Idea: reduce complexity of inference by ignoring some dependencies

Successfully used for approximating Most Probable Explanation:Very efficient on real-life (medical, decoding) and synthetic problems

)xP(max arg xx

Local approximation scheme “Mini-buckets” (paper submitted to JACM)

MPE on bound lower

MPE on bound upperaccuracy

Less “noise” => higher accuracy similarly to naïve Bayes!

General theory needed: Independence assumptions and “almost-deterministic” distributions

noise

Appro

xim

ati

on a

ccura

cy

Potential impact: efficient inference in complex performance management models (e.g., event mining, system dependence models)

Page 36: Advances in Bayesian Learning

Irina Rish, IBM TJWRC36

Theory and algorithms: analysis of Naïve Bayes accuracy (Research Report) approximate Bayesian inference (submitted paper) patent on meta-learning

Theory and algorithms: analysis of Naïve Bayes accuracy (Research Report) approximate Bayesian inference (submitted paper) patent on meta-learning

Summary

Machine-learning tools: (alphaWorks) Extending ABLE w/ Bayesian classifier Applying classifier to other IBM projects:

Video character recognition Environmental data analysis

Machine-learning tools: (alphaWorks) Extending ABLE w/ Bayesian classifier Applying classifier to other IBM projects:

Video character recognition Environmental data analysis

Performance management: End-user transaction recognition: (Lotus CoC)

novel method, patent, paper; applied to Lotus Notes In progress: event mining (USAA), probing strategies (EPP)

Performance management: End-user transaction recognition: (Lotus CoC)

novel method, patent, paper; applied to Lotus Notes In progress: event mining (USAA), probing strategies (EPP)

Page 37: Advances in Bayesian Learning

Irina Rish, IBM TJWRC37

Future directions

Automated learning and inferenceAutomated learning and inference

Research interest

Practical ProblemsPractical Problems

Generic toolsGeneric tools

TheoryTheory

Performance management: Transaction recognition – better feature selection, segmentation Event Mining – Bayes net models, clustering Web log analysis – segmentation/ classification/ clustering Modeling system dependencies – Bayes nets “Technology transfer” – generic approach to “event streams” (EUTs, sys.events, web page accesses)

ML library / ABLE: Bayesian learning

general Bayes nets temporal BNs incremental learning

Bayesian inference Exact inference Approximations

Other tools: SVMs, decision trees Combined tools, meta-learning tools

Analysis of algorithms: Naïve Bayes accuracy: other distribution types Accuracy of local inference approximations

Comparing model selection criteria (e.g., Bayes net learning)

Relative analysis and combination of classifiers (Bayes/max. margin/DT)

Incremental learning

Page 38: Advances in Bayesian Learning

Irina Rish, IBM TJWRC38

Collaborations Transaction recognition

J. Hellerstein, T. Jayram (Watson) Event Mining

J. Hellerstein, R. Vilalta, S. Ma, C. Perng (Watson) ABLE

J. Bigus, R. Vilalta (Watson) Video Character Recognition

C. Dorai (Watson) MDL approach to segmentation

B. Dom (Almaden) Approximate inference in Bayes nets

R. Dechter (UCI) Meta-learning

R. Vilalta (Watson) Environmental data analysis

Y. Chang (Watson)

Page 39: Advances in Bayesian Learning

Irina Rish, IBM TJWRC39

Machine learning discussion group

Weekly seminars: 11:30-2:30 (w/ lunch) in 1S-F40

Active group members: Mark Brodie, Vittorio Castelli, Joe Hellerstein, Daniel

Oblinger, Jayram Thathachar, Irina Rish (more people joint recently)

Agenda: discussions of recent ML papers, book chapters (“Pattern Classification” by Duda, Hart, and Stork, 2000) brain-storming sessions about particular ML topics Recent discussions: accuracy of Bayesian classifiers (naïve Bayes)

Web site:http://reswat4.research.ibm.com/projects/mlreadinggroup/

mlreadinggroup.nsf/main/toppage