exploiting common subrelations: learning one belief net for many classification tasks r greiner, wei...

59
Exploiting Common SubRelations: Learning One Belief Net for Many Classification Tasks R Greiner, Wei Zhou University of Alberta

Post on 19-Dec-2015

214 views

Category:

Documents


0 download

TRANSCRIPT

Exploiting Common SubRelations:Learning One Belief Net for Many Classification

Tasks

R Greiner, Wei Zhou University of Alberta

SituationCHALLENGE: Need to learn k classifiers

Cancer, from medical symptoms Meningitis, from medical symptoms Hepatitis, from medical symptoms …

Option 1: Learn k different classifier systems{SCancer, SMenin, …, Sk}

Then use Si to deal with ith “query class” but… but… need to re-learn

inter-relations among Factors, Symptoms,

common to all k classifiers

Common Interrelationships

Cancer Menin Cancer Menin

Use Common Structure! CHALLENGE: Need to learn k classifiers

Cancer, from medical symptoms Meningitis, from symptoms Hepatitis, from symptoms …

Option 2: Learn 1 “structure” S of relationshipsthen use S to address all k classification tasks

Actual Approach: Learn 1 Bayesian BeliefNet,

inter-relating info for all k types of queries

OutlineMotivationMotivation

Handle multiple class variables

FrameworkFramework Formal model Belief Nets, …-classifier

ResultsResults Theoretical Analysis Algorithms (Likelihood vs Conditional Likelihood)

Empirical Comparison• 1 Structure vs k Structures; LL vs LCL

ContributionsContributions

Cancer Menin Gender Age Smoke Height Btest

T F 35 T

F M 25 6’

T F t

F T 5’3” t

Cancer

Menin

Gender

Age

Smoke

Height

Btest

? M 18 T f

? M 5’0” f

Cancer

Menin

Gender

Age

Smoke

Height

Btest

TT M 18 T f

FF M 5’0” f

MC-Learner

MC

Training

Data

Return value Q = q

Multi-Classifier I/OGiven “query”

“class variable” Q and “evidence” E=e

Cancer=?, given Gender=F, Age=35, Smoke=t ?

Cancer = Yes

Cancer Menin Gender Age Smoke Height Btest

? F 35 T

Cancer Menin Gender Age Smoke Height Btest

Yes F 35 T

MultiClassifier

Like standard Classifiers, can deal with different evidence E different evidence values e

Unlike standard Classifiers, can deal with different class variables Q

MC(Cancer; Gender=M, Age=25, Height=6’) = No

MC(Meningitis; Gender=F, BloodTest = t ) = Severe

Able to “answer queries” classify new unlabeled tuples

Given “Q=?, given E=e”, return “q”

MC-Learner’s I/OInput: Set of “queries” (labeled partially-specified tuples)

input to standard (partial-data) learners

Output: MultiClassifier

Query var Q Evidence vars ECancer = t Gender=F, Age=35, Smoke=t

Cancer = f Gender = M, Age = 25, Height=6’

Menin = t Gender = F, Btest = t

Cancer Menin Gender Age Smoke Height Btest

T F 35 T

F M 25 6’

T F t

Error Measure“Labeled query” [Q, E=e], q

Prob([Q, E=e] asked) Query Distribution: …can be uncorrelated with “tuple distribution”

MC(Q, E=e) = q’MultiClassifier MC returns

CE(MC) = [Q, E=e], qProb([Q, E=e] asked) [|MC(Q, E=e) =?q|]

Classification Error of MC

[|a =? b|] 1 if a=b, 0 otherwise “0/1” error

Learner’s TaskGiven

space of “MultiClassifiers” { MCi } sample of labeled queries

drawn from “query distribution”

MC*= argmin{ MCi }{CE(MCi) } w/minimal error

over query distribution.

Cancer Menin Gender Age Smoke Height Btest

T F 35 T

F M 25 6’

T F t

Find

Outline MotivationMotivation

Handle multiple class variables

FrameworkFramework Formal model

ResultsResults Theoretical Analysis Algorithms (Likelihood vs Conditional Likelihood)

Empirical Comparison• 1 Structure vs k Structure; LL vs LCL

ContributionsContributions

Belief Nets, …-classifier

Simple Belief NetH

B

J

P(J | H, B=0) = P(J | H, B=1) J, H ! P( J | H, B) = P(J | H)

J is INDEPENDENT of B, once we know HDon’t need B J arc!

h P(B=1 | H=h)

1 0.95

0 0.03

P(H=1)

0.05

h P(J=1| h )

1 0.8

0 0.3

Skip Details

Example of a Belief Net

Simple Belief Net:

0.950.05

P(H=0)P(H=1)

0.970.030

0.050.951

P(B=0 | H=h)P(B=1 | H=h)h

0.70.300

0.7

0.2

0.2

P(J=0|h,b)

0.310

0.801

0.811

P(J=1|h,b)bh

H

B

J

Node ~ VariableLink ~ “Causal dependency”

“CPTable” ~ P(child | parents)Skip

Encoding Causal Links (cont’d)H

B

J

P(J | H, B=0) = P(J | H, B=1) J, H ! P( J | H, B) = P(J | H)

J is INDEPENDENT of B, once we know HDon’t need B J arc!

h P(B=1 | H=h)

1 0.95

0 0.03

P(H=1)

0.05

h b P(J=1|h , b )

1 1 0.8

1 0 0.8

0 1 0.3

0 0 0.3

Encoding Causal Links (cont’d)H

B

J

P(J | H, B=0) = P(J | H, B=1) J, H ! P( J | H, B) = P(J | H)

J is INDEPENDENT of B, once we know HDon’t need B J arc!

h P(B=1 | H=h)

1 0.95

0 0.03

P(H=1)

0.05

h P(J=1|h )

1 0.8

1

0 0.3

0

Encoding Causal Links (cont’d)H

B

J

P(J | H, B=0) = P(J | H, B=1) J, H ! P( J | H, B) = P(J | H)

J is INDEPENDENT of B, once we know HDon’t need B J arc!

h P(B=1 | H=h)

1 0.95

0 0.03

P(H=1)

0.05

h P(J=1| h )

1 0.8

0 0.3

Include Only Causal LinksSufficient Belief Net:

Requires: P(H=1) knownP(J=1 | H=1) knownP(B=1 | H=1) known

(Only 5 parameters, not 7)

H

B

J

P(H=1)

0.05

h P(B=1 | H=h)

1 0.95

0 0.03h P(J=1 | H=h)

1 0.8

0 0.3

Hence: P(H=1 | J=0, B=1) = P(H=1) P(J=0 | H=1) P(B=1 | J=0,H=1) 1

P(B=1 | H=1)

BeliefNet as (Multi)ClassifierFor query [Q, E=e], BN will return distribution

PBN(Q=q1 | E=e ), PBN(Q=q2 | E=e ), … PBN(Q=qm | E=e )

(Multi)Classifier MCBN(Q, E=e ) = argmaxqi { PBN(Q= qi | E=e ) }

qq11 q q22 q q33 … q … qmm

ProbProb

Learning Belief Nets Belief Net = G,

G = directed acyclic graph (“structure” – what’s related to what”)

= “parameters” – strength of connections Learning Belief Net G, from “data”:

1. Learning structure G

2. Find parameters that are best, for G

Our focus: #2 (parameters); Best minimal CE-error

Learning BN Multi-Classifier

Structure G + Labeled Queries

Goal: Find CPtables to minimize CE error

Cancer Menin Gender Age Smoke Height Btest

T F 35 T

F M 25 6’

T F t

F T 5’3” t

* = argmin { [Q, E=e], vProb([Q, E=e] asked)

[|MC G,

(Q, E=e) =? q|] }

Issues

Q1: How many labeled queries are required?

Q2: How hard is learning, given distributional info?

Q3: What is best algorithm for learning … … Belief Net?

… Belief Net Classifier?

… Belief Net Multiclassifier?

Sample Complexity: … BN structure w/ N variables, K CPtable entries, i >0, needsample of

labeled queries.

Q1, Q2: Theoretical Results• PAC(, )-learn CPtables:

Given BN structure, find CPtables whose CE-error is, with prob 1-, within of optimal

)

K6( log K

2log

ln18 ),(

2

KM

NN

Computational Complexity: NP-hard to find CPtable w/ min’l CE error (over for any O(1/N) ) from labeled queries… from known structure!

Use Conditional LikelihoodGoal: minimize “classification error”,

based on training sample [Qi, Ei=ei], qi*

Sample typically includes high-probability queries [Q, E=e] only most likely answers to these queries

q* = argmaxq { P( Q=q | E=e ) }

LCLD( ) = [q*,e] D log P( Q=q* | E=e )

Maximize Conditional Likelihood

As NP-hard… Not standard model?

Gradient Descent Alg: ILQ

How to change CPtable c|f = B(C=c | F=f)given datum “[Q=q, E=e]”,corresponding to

Q

C

……

c|f …

P(C|f1, f2)F2F1

F2F1

E

Cancer Menin Gender Age Smoke Height Btest

Yes F 35 T

)|,(),|,(1

||

)|(

efcBeqfcBd

LCLd

fcfc

eq

+ sum over queries “[Q=q, E=e]”, conjugate gradient, …

Descend along derivative:

Better Algorithm: ILQConstrained Optimization

(c|f 0, c=0|f + c=1|f = 1)

New parameterization c|f :

fcc

cf

e

efc '

'|

for each “row” rj, set c0|rj = 0 for one c0

)|(),|(

)],|(),|,([

|

|

)|(

efBeqfB

eqfBeqfcBd

LCLd

fc

fc

eq

Q

C

……

c|f …

P(C|f1, f2)F2F1

F2F1

E

Q3: How to Learn BN MultiClassifier?

Approach 1: Minimize error Maximize Conditional Likelihood

(In)Complete Data: ILQ

Approach 2: Fit to dataApproach 2: Fit to data Maximize LikelihoodMaximize Likelihood

Complete Data: Observed Frequency EstimateComplete Data: Observed Frequency Estimate Incomplete Data: EM / Incomplete Data: EM / APNAPN

Empirical StudiesTwo different objectives 2 learning algs

Maximize Conditional Likelihood: ILQ Maximize Likelihood: Maximize Likelihood: APNAPN

Two different approaches to MultipleClasses 1 copy of structure k copies of structure k naïve-bayesk naïve-bayes

Several “datasets” Alarm Insurance ……

Error: “0/1”; MSE() = i[Ptrue(qi|ei) – P (qi|ei)]2

1- vs k- Structures

Cancer Menin

Cancer

Menin

Gender

Age

Smoke

Height

Btest

T F 3 T

F M 2 6’

T F t

F T 5’3 t

Cancer MeninMenin

Cancer

Menin

Gender

Age

Smoke

Height

Btest

T F 3 T

F M 2 6’

T

F T 5’3 t

CancerCancer Menin

Cancer

Menin

Gender

Age

Smoke

Height

Btest

T

F

T F t

F

Alarm Belief Net 37 vars, 46 links, 505 parameters

Empirical Study I: Alarm

Query Distribution [HC’91] says, typically

8 vars QQ N appear as query 16 vars EE N appear as evidence

Select Q QQ uniformly Use same set of 7 evidence E EE Assign value e for E , based on Palarm(E =e)

Find “value” v based on Palarm(Q =v | E =e)

Each run uses m such queries, m=5,10, … 100, …

Results (Alarm; ILQ; SmallSample)CE

MSE

Results (Alarm; ILQ; LargeSample)CE

MSE

Comments on Alarm ResultsFor small Sample Size

“ILQ- 1 structure” better than “ILQ- k structures”

For large Sample Size “ILQ- 1 structure” “ILQ- k structures”

• ILQ-k has more parameters to fit, but … lots of data

APN ok, but much slower (did not converge in bounds)APN ok, but much slower (did not converge in bounds)

Empirical Study II: InsuranceInsurance Belief Net

27 vars, (3 query, 8 evidence) 560 parameters

Distribution: Select 1 query

randomly from 3 Use all 8 evidence …

(Simplified Version)

Results (Insur; ILQ)CE

MSE

Summary of ResultsLearning for given structure,

to minimize CED() or MSED() Correct structure

Small number of samples

• ILQ-1 (APN-1)(APN-1) win (over ILQ-k, APN-k, APN-k)

Large number of samples

• ILQ-k ILQ-1win (over APN-1, APN-k)(over APN-1, APN-k)

Incorrect structure (naïve-bayes)Incorrect structure (naïve-bayes) ILQ winsILQ wins

Future WorkBest algorithm for learning optimal BN?

Actually optimize CE-Err (not LCL) Learning STRUCTURE as well as CPtables Special cases where ILQ is efficient (?complete data?)

Other “learning environments” Other prior knowledge -- Query Forms Explicitly-Labeled Queries

Better understanding of sample complexityw/out “” restriction

Related WorkLike (ML) classification but…

Probabilities, not discrete Diff class var’s, diff evidence sets... … see Caruana

““Learning to Reason” [KR’95]Learning to Reason” [KR’95]“do well on tasks that will be encountered”“do well on tasks that will be encountered”… but different performance system… but different performance system

Sample Complexity [FY, Hoeffgen] … diff learning model

Computational Complexity [Kilian/Naor95] NP to find ANY distr w/min L1-error wrt uncond query for BN L2 conditional

Exploiting Common Relations 40

Take Home Msgs To max performance:

use Conditional Likelihood (ILQ)not Likelihood (APN/EM, OFE)

Especially if structure wrong, small sample, …

… controversial… To deal with MultiClassifiers

Use 1 structure, not k If small sample, 1struct better performance

If large sample, same performance, … but 1struct smaller

… yes, of course… Relation to Attrib vs Relation:

Not “1 example for many class of queries” but “1 example for 1 class of queries,

BUT IN ONE COMMON STRUCTURE”

ContributionsAppropriate model for learning

Extends standard learning environments Labeled Queries, with different class variables

Sample Complexity Need “few” labeled-queries

Computation Complexity Effective Algorithm NP-hard Gradient descent

Empirical Evidence: works well!http://www.cs.ualberta.ca/~greiner/BN-results.html

Learn MultiClassifier that works well in practice

Questions?

LCL vs LLDoes diff matter?ILQ vs APNQuery FormsSee also

http://www.cs.ualberta.ca/~greiner/BN-results.html

Learning ModelMost belief net learners try to maximize

LIKELIHOOD

Our goal is different: We want to minimize error, over distribution of queries.

LL D ( ) = xD log P( x )

… as goal is “fit to data” D

If never asked

don’t care if

“What is p(jaun | btest- ) ?”

BN(jaun | btest- ) p(jaun | btest- )

Different Optimization

LL D( ) = [q*,e] D log P( Q=q* | E=e ) + [q*,e] D log P( E=e )

= LCL D( ) + [q*,e] D log P( E=e )

As [q*,e] D log P( E=e ) non-trivial,

LL = argmax { LL D( ) } LCL = argmax {LCL D( ) }

Discriminant analysis: Maximize Overall Likelihood vsMinimize Predictive Error

To find LCL: NP-hard, so…ILQ Return

LL LCL

A belief net is … representation for a distributionrepresentation for a distribution system for answering queriessystem for answering queries

Suppose BN must answer: “What is p(hep | jaun, btest- ) ?”

but not “What is p(jaun | btest-) ?”

So… BN is good if even if

Why Alternative Model?

BN(hep | jaun, btest- ) = p(hep | jaun, btest- )

BN(jaun | btest- ) p(jaun | btest- )

Query Distr vs Tuple Distr

Distribution over tuples p(q) p(hep, jan, btest-, …) = 0.07

p(flu, cough, ~headache, …) = 0.43 Distribution over queries sq(q) = Prob(q asked)

Ask “What is p(hep | jan, btest-)?” 30%

Ask “What is p(flu | cough, ~headache)?” 22%

Can be uncorrelated: EG: Prob[ Asking Cancer ] = sq(“cancer”) = 100%

even if Pr[ Cancer ] = p(cancer) = 0

Query Distr Tuple DistrSpse GP asks all ADULT FEMALE patients

“Pregnant” ? Pregnant Adult Gender

+ + F

- + F

+ + F

Data P( Preg | Adult, Gender=F ) = 2/3Is this really TUPLE distr?

• P(Gender=F) = 1 ?

NO: only reflects questions asked ! Provide info re: P(preg | Adult=+, Gender=F) but NOT about P(Adult), …

Query Distr Tuple DistrQuery Probability:

independent of tuple probability:

Prob([Q, E=e] asked)

[Q, E=e], q* Note: value of query -- q* of -- IS based on P(Q=q | E=e)

P(Q=q, E=e)• Could always ask about 0-prob situation• Always ask “[Pregnant=t, Gender=Male]” sq(Pregnant=t, Gender=Male)=1, but P(Pregnant=t, Gender=Male ) = 0

P(E=e)• If sq(Q, E=ei) P(E=ei), then

• P(Gender=Female ) = P(Gender=Male ) sq(Pregnant, Gender=Female) = sq(Pregnant, Gender=Male)

Return

Does it matter? If all queries involve same query variablesame query variable, ok to pretend

sq(.) ~ p(.) as no-one ever asks about EVIDENCE DISTRIBUTION

Pregnant Adult Gender

+ + F

- + F

+ + F

Eg, in

As no one asks “What is P(Gender)?, doesn’t matter …

But problematic in MultiClassifier…

if other queries – eg, sq(Gender; .)

ILQ (cond likelihood) vs APN (likelihood)

Wrong structure: ILQ better than APN/EM Experiments…

• Artificial data

• Using Naïve Bayes (UCI)

Correct structure ILQ often better than OFE, APN/EM Experiments

Discriminant analysis: Maximize Overall Likelihood vs Minimize Predictive Error

Wrong Structure Imth target distribution:

“TAN” w/ E1 E2 … Em

QQ

EE11 EEmm EEkk……EE11……

QQ

EE11 EEmm EEkk……EE11……

Wrong structure (NB)

Results… (k=5, m=0..4)

Results – Wrong StructureCE

MSE

Wrong Structure IILearn NaiveBayes for REAL-World dataset:

Chess (FLARE, DNA)

CE:

MSE:

Correct StructureIf structure is correct

ILQ, OFE / (APN,EM) should converge to optimal

Which is more efficient?

Depends…

“Correct” StructureFill in parameter for CORRECT STRUCTURE

for REAL-World dataset: Chess (FLARE, DNA)Structure learned using PowerConstructor

CE:

Summary of Results

Dataset ILQ OFE EM APNFlare 0.1756 0.198

DNA 0.0489 0.0557

Chess 0.0558 0.1423

Vote 0.0345 0.1057 0.1057

MSE results vs OFE if data complete vs APN/EM if data incomplete

Query FormsMD asks “What is P(D|A, B)?” 20%

sq(D=t|A,B) = 0.2 sq(D=t; A,B) = i sq(D=t; A=ai, B=bi)

Challenge#1: Subdistribution: sq(D; A=ai,B=bi) = sq(D; A,B)

(A=ai,B=bi| “Asked D|A,B question”)

Perhaps is Uniform? …or = P(A=ai,B=bi) ? NO! sq(Pregnant | Gender) = 1.0

? sq( Preg | Gend=M) = sq( Preg | Gend=F) ??

Challenge#2: Need 2k labels!… but in UQT model, perhaps not needed…

In UQT, may need to SHRINK network!… but QueryFORMS may be sufficient!

Back

For each query p(q|e), for each cf,

Efficiency of ILQ

If q,e d-separated from c,f: d LCL(q|e) /d c|f = 0, skip!Saves 10-90% of work!

Current Timing (PIII-500)

ALARM: 100 millisec / query (each iter) INSURANCE: 30 millisec/query (each iter)

)|(),|(

)],|(),|,([

|

|

)|(

efBeqfB

eqfBeqfcBd

LCLd

fc

fc

eq

Results (Alarm; ILQ-1/APN-1; LargeSample)

CE

MSE