advances in bayesian learning

1

Machine Learning in Performance Management

Irina RishIBM T.J. Watson Research

CenterJanuary 24, 2001

Irina Rish, IBM TJWRC2

Outline

Introduction

Machine learning applications in Performance Management

Bayesian learning tools: extending ABLE

Advancing theory

Summary and future directions


Pattern discovery, classification, diagnosis and prediction

Pattern discovery, classification, diagnosis and prediction

Learning problems: examples

System event mining

Even

ts f

rom

hosts

Time

End-user transaction recognition

5R5R3R 2R 2R1R2R

Remote Procedure Calls (RPCs)

BUY?SELL?

OPEN_DB?SEARCH?

Transaction1 Transaction2


Approach: Bayesian learning

Numerous important applications: Medicine Stock market Bio-informatics eCommerce Military ………

classmax

Diagnosis:P(cause|

symptom)=?

Diagnosis:P(cause|

symptom)=?

Learn (probabilistic) dependency modelsLearn (probabilistic) dependency models

C

S

B

DX

P(S)

P(B|S)

P(X|C,S)

P(C|S)

P(D|C,B)

Prediction:P(symptom|

cause)=?


cause)=?

Bayesian networks

Pattern classification:

P(class|data)=?

Pattern classification:

P(class|data)=?


Outline

Introduction

Machine-learning applications in Performance Management

Transaction Recognition In progress: Event Mining;

Probe Placement; etc.


Advancing theory



End-User Transaction Recognition: why is it important?

Client Workstation

End-UserTransactions (EUT)

Remote ProcedureCalls (RPCs)

Server (Web, DB, Lotus Notes)

Session (connection)

Realistic workload models (for testing performance)

Resource management (anticipating requests)

Quantifying end-user perception of performance (response times)

OpenDB Search SendMail

Examples: Lotus Notes, Web/eBusiness (on-line stores, travel agencies, trading):database transactions, buy/sell, search, email, etc.

RPCs?


Why is it hard? Why learn from data?

MoveMsgToFolder

FindMailByKey

1. OPEN_COLLECTION2. UDATE_COLLECTION3. DB_REPLINFO_GET4. GET_MOD_NOTES5. READ_ENTRIES6. OPEN_COLLECTION7. FIND_BY_KEY8. READ_ENTRIES

EUTs

RPCs

Example: EUTs and RPCs in Lotus Notes

Many RPC and EUT types (92 RPCs and 37 EUTs) Large (unlimited) data sets (10,000+ Tx inst.) Manual classification of a data subset took about

a month Non-deterministic and unknown EUT RPC

mapping: “Noise” sources - client/server states No client-side instrumentation – unknown EUT

boundaries


Problem 2: both segment and label (EUT recognition)

1 2 1 3 4 1 2 31 2 1 3 2 31 2 1 2 31 2 1 2 4

Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx1 Tx3Tx2Tx2Tx2Tx2Tx2Tx2Tx2Tx2

Unsegmented RPC's

Segmented RPC's and Labeled Tx's Tx2

Our approach: Classification + Segmentation

Problem 1: label segmented data (classification)

Labeled Tx's

Segmented RPC's

Tx3Tx2

1 3 31 31 31 31 31 31 3 1111111122222222 1 3

Tx1 Tx 1

1 2 3 41 2 3 41 2 3 41 2 3 41 2 3 41 2 3 41 2 3 41 2 3 4

Tx3

11111111

(similar to text classification)

(similar to speech understanding, image segmentation)


How to represent transactions? “Feature vectors”

ijij p)T|1P(R :Bernoulli

M

1j

nijM

1j ij

iiMi1ijp

!n

n!)T|n,...,P(n :lMultinomia

)p(1p)T|P(n :Geometric ijn

ijiijij

)p(1p)T|P(n :Geometric Shifted ijsn

ijiijijij

i typeof nTransactio

5R 5R3R 2R 2R1R2R

RPC counts ...) 0, 2, 0, 1, 3, (1,f

...) 0, 1, 0, 1, 1, (1,f RPC occurrences

geometric shifted:)( data to fit Best 2


Classification scheme

RPCs labeledwith EUTs

LearningClassifier

Unlabeled RPCs

EUTs

Training phase

FeatureExtraction Classifier

Training data:

Operation phase

“Test” data: FeatureExtraction Classification


Our classifier: naïve Bayes (NB)

counts RPC or occurences RPC

:Features EUT)|P(f1 EUT)|P(f2 EUT)|P(fn

1f feature nf feature2f feature

EUTP(EUT)

2. Classification: given (unlabeled) instance , choose most likely class:

1. Training: estimate parameters and (e.g., ML-estimates) EUT)|P(fi

P(EUT)

)f,...,f|P(EUTmax arg EUT n1ii

(Bayesian decision rule)

)f,...,(f n1

Simplifying (“naïve”) assumption:feature independence given class


Classification results on Lotus CoC data

Significant improvement over baseline classifier (75%) NB is simple, efficient, and comparable to the state-of-the-art classifiers:

SVM – 85-87%, Decision Tree – 90-92% Best-fit distribution (shift. geom) - not necessarily best classifier! (?)

Baseline classifier:Always selects most-frequent transaction

Accu

racy

Training set size

NB + Bernoulli, mult. or geom.

NB + shifted geom.

2% 87

3% 79

1%10


Transaction recognition:segmentation + classification

)i,...,(iV m1 find : taskonSegmentati

n1jjpm1 ...rr ...r...r......rr

11 i 2i 3i ki

n ,1jRjR ,1

n

Naive Bayes classifier

)maxmax 10

kjT

jkj

k P(T,Rαα

),(maxarg )|(maxarg* 11 nV

nV

RVPRVPV onsegmentati probable most :Objective

Dynamic programming (Viterbi search)

(Recursive)DP equation:


Transaction recognition results

Third bestbestMultinomial

Fourth bestbestGeometric

bestworstShift. Geom.

Second bestbestBernoulli

SegmentationClassificationModel

64%- geom. shifted

geom. Multinom.,

Bernoulli

Accu

racy

Training set size

Good EUT recognition accuracy: 64% (harder problem than classification!) Reversed order of results: best classifier - not necessarily best recognizer! (?) further research!


EUT recognition: summary A novel approach: learning EUTs from RPCs Patent, conference paper (AAAI-2000), prototype system Successful results on Lotus Notes data (Lotus CoC):

Classification – naive Bayes (up to 87% accuracy) EUT recognition – Viterbi+Bayes (up to 64% accuracy)

Work in progress: Better feature selection (RPC subsequences?) Selecting “best classifier” for segmentation task Learning more sophisticated classifiers (Bayesian networks) Information-theoretic approach to segmentation (MDL)


Outline

Introduction


Transaction Recognition In progress: Event Mining;

Probing Strategy; etc.


Advancing theory



Event Mining:analyzing system event sequences

Example: USAA data 858 hosts, 136 event types 67184 data points: (13 days, by sec) Event examples:

High-severity events: 'Cisco_Link_Down‘, 'chassisMinorAlarm_On‘, etc. Low-severity events: 'tcpConnectClose‘, 'duplicate_ip‘, etc.

Even

ts f

rom

hosts

Time (sec)

What is it? Why is it important? learning system behavior patterns for better performance management

Why is it hard?

large complex systems (networks)

with many dependencies; prior models not always available many events/hosts, data sets: huge and constantly growing


???

Event1

Event

N

1. Learning event dependency models

Event2

EventM

Important issue: incremental learning from data streams

Current approach: learn dynamic probabilistic graphical models (temporal, or dynamic Bayes nets) Predict:

time to failure event co-occurrence existence of hidden nodes – “root causes”

Recognize sequence of high-level system states: unsupervised version of EUT recognition problem


2. Clustering hosts by their history

“Problematic” hosts “Silent” hosts

group hosts w/ similar event sequences: what is appropriate similarity (“distance”) metric? One example:

e.g., distance between “compressed” sequences – event distribution models:

type evente

(e),P S Seq.(e),P S Seq.

12

11

distance) Leibler-(Kullback entropy relative is

(e)P

(e)P(e)logP)P||D(P where),P||D(P)S,S dist(

e 2

11212121


Probing strategy (EPP)

Objectives: find probe frequency F that minimizes 1. E (Tprobe-Tstart) - failure detection, or

2. E( total “failure” time – total “estimated” failure time) -

gives accurate performance estimate Constraints on additional load induced by probes: L(F) <

MaxLoad

maxR

time

resp

on

se t

ime

s1v e

1v e2vs

2v Ts2t

e2t

Availabilityviolations Probes


Outline

Introduction



Advancing theory


ABLE: Agent Building and Learning Environment


What is ABLE? What is my contribution?

A JAVA toolbox for building reasoning and learning agents

Provides: visual environment, boolean and fuzzy rules, neural networks, genetic search

My contributions: naïve Bayes classifier (batch and

incremental) Discretization Future releases:

General Bayesian learning and inference tools

Available at AlphaWorks: www.alphaWorks.ibm.com/tech Project page: w3.rchland.ibm.com/projects/ABLE

How does it work?


Who is using Naïve Bayes tools?Impact on other IBM projects

Video Character Recognition:

(w/ C. Dorai): Naïve Bayes: 84% accuracy Better than SVM on some pairs

of characters (aver. SVM = 87%) Current work: combining Naïve

Bayes with SVMs

Environmental data analysis:(w/ Yuan-Chi Chang)

Learning mortality rates using data on air pollutants

Naïve Bayes is currently being evaluated

Performance management: Event mining – in progress EUT recognition – successful

results


Outline

Introduction

Machine-learning in Performance Management

Bayesian learning tools: extending ABLE Advancing theory

analysis of naïve Bayes classifier

inference in Bayesian Networks



Why Naïve Bayes does well? And when?


Class

When independence assumptions do not hurt classification?

Class-conditional feature independence:

jj class)|P(fclass)|(fP ˆ

Unrealistic assumption! But why/when it works?

Intuition: wrong probability estimates wrong classification!

optclass

True

NB estimate

P(c

lass

|f)

f)|(classP̂

f)|P(class

NBclassNaïve Bayes: f)|(classPmax arg i

i

ˆ

optclassBayes-optimal:f)|P(classmax arg i

i


Case 1: functional dependencies

Lemma 1: Naïve Bayes is optimal when features are functionally dependent given classProof

:

)(xP )(xP

))(x(fP))(x(fP )(xP)(xP

)x,...,(xP)x,...,(xP

,-}{cn1,..,i ),(xfx

-)P(C)P(C

-)C|P(X (X)P ),C|P(X (X)P ,-},{C

n1

n1

n

1i1i

n

1i1i

n

1ii

n

1ii

n1n1

1ii

-

:

:

for and

for :dependence functional 2. :priors uniform 1.

:AssumeLet

rule decisionBayes Naive

rule decision optimalBayes

)(xP)(xP 11

)(xP)(xP 11


0 100 200 300 400 5000

0.01

0.02

0.03

0.04

0.05

0.06

δ

ii af

class)|P(fi

Lemma 2: Naïve Bayes is a “good approximation” for “almost-functional” dependencies

) assumption nce (independemarginals of product joint

Case 2: “almost-functional” (low-entropy) distributions

Related practical examples: RPC occurrences in EUTs: often almost-deterministic (and NB does well) Successful “local inference” in almost-deterministic Bayesian networks

(Turbo coding, “mini-buckets” – see Dechter&Rish 2000)

Formally:

δ1)afP(

or ,δ1)aP(f ii

then

If

n1,...,i for ,

nδ |)aP(f)afP(| i

ii


Experimental results support theory

1. Less “noise” (smaller ) => NB closer to optimal

δ

δ-1Random problem generator: uniform P(class); random P(f|class):

1. A randomly selected entry in P(f|class) is assigned2. The rest of entries – uniform random sampling + normalization

2. Feature dependence does NOT correlate with NB error


Outline Introduction

Machine-learning in Performance Management Transaction Recognition

Event Mining

Bayesian learning tools: extending ABLE Advancing theory

analysis of naïve Bayes classifier

inference in Bayesian Networks



Example: Printer Troubleshooting (Microsoft Windows 95)

Print OutputOK

Correct Driver

UncorruptedDriver

CorrectPrinter Path

Net CableConnected

Net/LocalPrinting

Printer On and Online

CorrectLocal Port

Correct Printer

Selected

Local CableConnected

ApplicationOutput OK

PrintSpooling On

Correct Driver

Settings

Printer MemoryAdequate

NetworkUp

SpooledData OK

GDI DataInput OK

GDI Data Output OK

PrintData OK

PC to PrinterTransport OK

PrinterData OK

SpoolProcess OK

NetPath OK

LocalPath OK

PaperLoaded

Local DiskSpace Adequate

[Heckerman, 95]


MEU Decision-making

(given utility function)

MEU Decision-making

(given utility function)

How to use Bayesian networks?

Applications: Medicine Stock market Bio-informatics eCommerce Performance management etc.

1C 2C

cause

symptomsymptom

cause

Classification: P(class|data)=?

Classification: P(class|data)=?class

max


cause)=?


cause)=?

Diagnosis:P(cause|

symptom)=?

Diagnosis:P(cause|

symptom)=?

NP-complete inference problems

Approximate algorithms


Idea: reduce complexity of inference by ignoring some dependencies

Successfully used for approximating Most Probable Explanation:Very efficient on real-life (medical, decoding) and synthetic problems

)xP(max arg xx

Local approximation scheme “Mini-buckets” (paper submitted to JACM)

MPE on bound lower

MPE on bound upperaccuracy

Less “noise” => higher accuracy similarly to naïve Bayes!

General theory needed: Independence assumptions and “almost-deterministic” distributions

noise

Appro

xim

ati

on a

ccura

cy

Potential impact: efficient inference in complex performance management models (e.g., event mining, system dependence models)


Theory and algorithms: analysis of Naïve Bayes accuracy (Research Report) approximate Bayesian inference (submitted paper) patent on meta-learning

Theory and algorithms: analysis of Naïve Bayes accuracy (Research Report) approximate Bayesian inference (submitted paper) patent on meta-learning

Summary

Machine-learning tools: (alphaWorks) Extending ABLE w/ Bayesian classifier Applying classifier to other IBM projects:

Video character recognition Environmental data analysis

Machine-learning tools: (alphaWorks) Extending ABLE w/ Bayesian classifier Applying classifier to other IBM projects:

Video character recognition Environmental data analysis

Performance management: End-user transaction recognition: (Lotus CoC)

novel method, patent, paper; applied to Lotus Notes In progress: event mining (USAA), probing strategies (EPP)

Performance management: End-user transaction recognition: (Lotus CoC)

novel method, patent, paper; applied to Lotus Notes In progress: event mining (USAA), probing strategies (EPP)


Future directions

Automated learning and inferenceAutomated learning and inference

Research interest

Practical ProblemsPractical Problems

Generic toolsGeneric tools

TheoryTheory

Performance management: Transaction recognition – better feature selection, segmentation Event Mining – Bayes net models, clustering Web log analysis – segmentation/ classification/ clustering Modeling system dependencies – Bayes nets “Technology transfer” – generic approach to “event streams” (EUTs, sys.events, web page accesses)

ML library / ABLE: Bayesian learning

general Bayes nets temporal BNs incremental learning

Bayesian inference Exact inference Approximations

Other tools: SVMs, decision trees Combined tools, meta-learning tools

Analysis of algorithms: Naïve Bayes accuracy: other distribution types Accuracy of local inference approximations

Comparing model selection criteria (e.g., Bayes net learning)

Relative analysis and combination of classifiers (Bayes/max. margin/DT)

Incremental learning


Collaborations Transaction recognition

J. Hellerstein, T. Jayram (Watson) Event Mining

J. Hellerstein, R. Vilalta, S. Ma, C. Perng (Watson) ABLE

J. Bigus, R. Vilalta (Watson) Video Character Recognition

C. Dorai (Watson) MDL approach to segmentation

B. Dom (Almaden) Approximate inference in Bayes nets

R. Dechter (UCI) Meta-learning

R. Vilalta (Watson) Environmental data analysis

Y. Chang (Watson)


Machine learning discussion group

Weekly seminars: 11:30-2:30 (w/ lunch) in 1S-F40

Active group members: Mark Brodie, Vittorio Castelli, Joe Hellerstein, Daniel

Oblinger, Jayram Thathachar, Irina Rish (more people joint recently)

Agenda: discussions of recent ML papers, book chapters (“Pattern Classification” by Duda, Hart, and Stork, 2000) brain-storming sessions about particular ML topics Recent discussions: accuracy of Bayesian classifiers (naïve Bayes)

Web site:http://reswat4.research.ibm.com/projects/mlreadinggroup/

mlreadinggroup.nsf/main/toppage

advances in bayesian learning

Documents