multi-label learning from batch and streaming data · introduction k = 2 k >2 l = 1 binary...

79
Multi-label learning from batch and streaming data Jesse Read el´ ecom ParisTech ´ Ecole Polytechnique Summer School on Mining Big and Complex Data 5 September 2016 — Ohrid, Macedonia

Upload: others

Post on 13-Oct-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Multi-label learning from batch andstreaming data

Jesse Read

Telecom ParisTech

Ecole Polytechnique

Summer School on Mining Big and Complex Data5 September 2016 — Ohrid, Macedonia

Page 2: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Introduction

x =

Binary classification

y ∈ {sunset, non sunset}

Page 3: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Introduction

x =

Multi-class classification

y ∈ {sunset, people, foliage, beach, urban}

Page 4: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Introduction

x =

Multi-label classification

y ⊆ {sunset, people, foliage, beach, urban}∈ {0, 1}5 = [1, 0, 1, 0, 0, 0]

i.e., multiple labels per instance instead of a single label.

Page 5: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Introduction

K = 2 K > 2L = 1 binary multi-classL > 1 multi-label multi-output†

† also known as multi-target, multi-dimensional.

Figure: For L target variables (labels), each of K values.

multi-output can be cast to multi-label, just as multi-classcan be cast to binary

set of labels (L) is predefined (contrast to tagging/keywordassignment)

Page 6: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Introduction

Table: Academic articles containing the phrase “multi-labelclassification” (Google Scholar, 2016)

year in text in title1996-2000 23 12001-2005 188 182006-2010 1470 1642011-2015 5280 629

Page 7: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Single-label vs. Multi-label

Table: Single-label Y ∈ {0, 1}

X1 X2 X3 X4 X5 Y1 0.1 3 1 0 00 0.9 1 0 1 10 0.0 1 1 0 01 0.8 2 0 1 11 0.0 2 0 1 0

0 0.0 3 1 1 ?

Table: Multi-label Y ⊆ {λ1, . . . , λL}

X1 X2 X3 X4 X5 Y1 0.1 3 1 0 {λ2, λ3}0 0.9 1 0 1 {λ1}0 0.0 1 1 0 {λ2}1 0.8 2 0 1 {λ1, λ4}1 0.0 2 0 1 {λ4}0 0.0 3 1 1 ?

Page 8: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Single-label vs. Multi-label

Table: Single-label Y ∈ {0, 1}

X1 X2 X3 X4 X5 Y1 0.1 3 1 0 00 0.9 1 0 1 10 0.0 1 1 0 01 0.8 2 0 1 11 0.0 2 0 1 0

0 0.0 3 1 1 ?

Table: Multi-label [Y1, . . . ,YL] ∈ 2L

X1 X2 X3 X4 X5 Y1 Y2 Y3 Y4

1 0.1 3 1 0 0 1 1 00 0.9 1 0 1 1 0 0 00 0.0 1 1 0 0 1 0 01 0.8 2 0 1 1 0 0 11 0.0 2 0 1 0 0 0 1

0 0.0 3 1 1 ? ? ? ?

Page 9: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Outline

1 Introduction

2 Applications

3 Methods

4 Label Dependence

5 Multi-label Classification in Data Streams

Page 10: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Text Categorization and TagRecommendation

For example, the IMDb dataset: Textual movie plot summariesassociated with genres (labels). Also: Bookmarks, Bibtex,del.icio.us datasets.

Page 11: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Text Categorization and TagRecommendation

For example, the IMDb dataset: Textual movie plot summariesassociated with genres (labels). Also: Bookmarks, Bibtex,del.icio.us datasets.

aban

don

ed

acci

den

t

...

viol

ent

wed

din

g

horror

romance

. . . comedy

action

i X1 X2 . . . X1000 X1001 Y1 Y2 . . . Y27 Y28

1 1 0 . . . 0 1 0 1 . . . 0 02 0 1 . . . 1 0 1 0 . . . 0 03 0 0 . . . 0 1 0 1 . . . 0 04 1 1 . . . 0 1 1 0 . . . 0 15 1 1 . . . 0 1 0 1 . . . 0 1...

......

. . ....

......

.... . .

......

120919 1 1 . . . 0 0 0 0 . . . 0 1

Page 12: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Labelling Images

Images are labelled to indicate

multiple concepts

multiple objects

multiple people

e.g., Associating Scenes with concepts⊆ {beach, sunset, foliage, field, mountain, urban}

Page 13: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Labelling Audio

For example, labelling music with emotions, concepts, etc.

amazed-surprised

happy-pleased

relaxing-calm

quiet-still

sad-lonely

angry-aggressive

amazed happy relaxing quiet sad angry

amazed

happy

relaxing

quiet

sad

angry

Page 14: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Related Tasksmulti-output1 classification: outputs are nominal

X1 X2 X3 X4 X5 ran

k

gen

der

gro

up

x1 x2 x3 x4 x5 1 M 2x1 x2 x3 x4 x5 4 F 2x1 x2 x3 x4 x5 2 M 1

multi-output regression: outputs are real-valued

X1 X2 X3 X4 X5 pri

ce

age

per

cen

t

x1 x2 x3 x4 x5 37.00 25 0.88x1 x2 x3 x4 x5 22.88 22 0.22x1 x2 x3 x4 x5 88.23 11 0.77

label ranking, i.e., preference learning

λ3 � λ1 � λ4 � . . . � λ2

1aka multi-target, multi-dimensional

Page 15: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Related Areasmulti-task learning: multiple tasks, shared representationsequential learning: predict across time indices instead ofacross label indices; each label may have a different input

y1 y2 y3 y4

x1 x2 x3 x4

structured output prediction: assume particular structureamoung outputs, e.g., pixels, hierarchy

n21

n16

n11

n6

n1

n22

n17

n12

n7

n2

n23

n18

n13

n8

n3

n24

n19

n14

n9

n4

n25

n20

n15

n10

n5

1

6

2

10

3 4

5

8

9

7

Page 16: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Streaming Multi-label Data

Many advanced applications must deal with data streams:

Data arrives continuously, potentially infinitely

Prediction must be made immediately

Expect concept drift

For example,

Demand prediction

Intrusion detection

Pollution detection

Page 17: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Demand PredictionOutputs (labels) represent the demand at multiple points.

Figure: Stops in the greater Helsinki region. The Kutsuplus taxiservice could be called to any of these.

We are interested in predicting, for each label [y1, . . . , yL],

p(yj = 1|x) • probability of demand at j-th node

Page 18: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Localization and TrackingOutputs represent points in space which may contain anobject (yj = 1) or not (yj = 0). Observations are given as x.

0 1 2 3 4 5 6 7 8 9

1

2

3

4

5

6

7

8

9

Figure: Modelled on a real-world scenario; a room with a single lightsource and a number of light sensors.

We are interested in predicting, for each label [y1, . . . , yL],

yj = 1 • if j-th tile occupied

Page 19: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Route/Destination ForecastingPersonal nodes of a traveller and predicted trajectory

L • number of geographic points of interest

x • observed data (e.g., GPS, sensor activity, time of day)

p(yj = 1|x) • probability an object is present at the j-th node

{xi, yi}Ni=1 • training data

Page 20: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Outline

1 Introduction

2 Applications

3 Methods

4 Label Dependence

5 Multi-label Classification in Data Streams

Page 21: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Multi-label Classification

y4y3y2y1

x

yj = hj(x) = argmaxyj∈{0,1}

p(yj |x) • for index, j = 1, . . . ,L

and then,

y = h(x) = [y1, . . . , y4]

=[

argmaxy1∈{0,1}

p(y1|x), · · · , argmaxy4∈{0,1}

p(y4|x)]

=[

f1(x), · · · , f4(x)]= f (W>x)

This is the Binary Relevance method (BR).

Page 22: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

BR Transformation1 Transform dataset . . .

X Y1 Y2 Y3 Y4

x(1) 0 1 1 0x(2) 1 0 0 0x(3) 0 1 0 0x(4) 1 0 0 1x(5) 0 0 0 1. . . into L separate binary problems (one for each label)

X Y1

x(1) 0x(2) 1x(3) 0x(4) 1x(5) 0

X Y2

x(1) 1x(2) 0x(3) 1x(4) 0x(5) 0

X Y3

x(1) 1x(2) 0x(3) 0x(4) 0x(5) 0

X Y4

x(1) 0x(2) 0x(3) 0x(4) 1x(5) 1

2 and train with any off-the-shelf binary base classifier.

Page 23: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Why Not Binary Relevance?

BR ignores label dependence, i.e.,

p(y|x) =L∏

j=1

p(yj |x)

which may not always hold!

Example (Film Genre Classification)

p(yromance|x) 6= p(yromance|x, yhorror)

Page 24: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Why Not Binary Relevance?BR ignores label dependence, i.e.,

p(y|x) =L∏

j=1

p(yj |x)

which may not always hold!

Table: Average predictive performance (5 fold CV, EXACT MATCH)

L BR MCC

Music 6 0.30 0.37Scene 6 0.54 0.68Yeast 14 0.14 0.23Genbase 27 0.94 0.96Medical 45 0.58 0.62Enron 53 0.07 0.09Reuters 101 0.29 0.37

Page 25: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Classifier Chains

Modelling label dependence,

y4y3y2y1

x

p(y|x) = p(y1|x)L∏

j=2

p(yj |x, y1, . . . , yj−1)

and,y = argmax

y∈{0,1}L

p(y|x)

Page 26: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Bayes Optimal CC

Example

0

0

0

0.5

10.50.2

1

0

0.9

10.1

0.8

0.4

1

0

0

0.4

10.60.7

1

0

0.5

10.5

0.3

0.6

1 p(y = [0, 0, 0]) = 0.0402 p(y = [0, 0, 1]) = 0.0403 p(y = [0, 1, 0]) = 0.2884 . . .6 p(y = [1, 0, 1]) = 0.2527 . . .8 p(y = [1, 1, 1]) = 0.090

return argmaxy p(y|x)

Search space of {0, 1}L paths is too much

Page 27: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

CC Transformation

Similar to BR: make L binary problems, but include previouspredictions as feature attributes,

X Y1

x(1) 0x(2) 1x(3) 0x(4) 1x(5) 0

X Y1 Y2

x(1) 0 1x(2) 1 0x(3) 0 1x(4) 1 0x(5) 0 0

X Y1 Y2 Y3

x(1) 0 1 1x(2) 1 0 0x(3) 0 1 0x(4) 1 0 0x(5) 0 0 0

X Y1 Y3 Y3 Y4

x(1) 0 1 1 0x(2) 1 0 0 0x(3) 0 1 0 0x(4) 1 0 0 1x(5) 0 0 0 1

and, again, apply any classifier (not necessarily a probabilisticone)!

Page 28: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Greedy CC

y4y3y2y1

x

L classifiers for L labels. For test instance x, classify1 y1 = h1(x)2 y2 = h2(x, y1)

3 y3 = h3(x, y1, y2)

4 y4 = h4(x, y1, y2, y3)

and returny = [y1, . . . , yL]

Page 29: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Example

0

0

0

1

1

0

1

1

0

0

1

1

0

1

y = h(x) = [?, ?, ?]

y3y2y1

x

1 y1 = h1(x) =argmaxy1

p(y1|x) = 1

2 y2 = h2(x, y1) = . . . = 03 y3 = h3(x, y1, y2) = . . . = 1

Improves over BR; similar build time (if L < D);

able to use any off-the-shelf classifier for hj ; parralelizable

But, errors may be propagated down the chain

Page 30: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Example

0

0

0

1

1

0

10.4

1

0

0

1

1

0

1

0.6

y = h(x) = [1, ?, ?]

y3y2y1

x

1 y1 = h1(x) =argmaxy1

p(y1|x) = 1

2 y2 = h2(x, y1) = . . . = 03 y3 = h3(x, y1, y2) = . . . = 1

Improves over BR; similar build time (if L < D);

able to use any off-the-shelf classifier for hj ; parralelizable

But, errors may be propagated down the chain

Page 31: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Example

0

0

0

1

1

0

1

1

0

0

10.7

1

0

1

0.3

0.6

y = h(x) = [1, 0, ?]

y3y2y1

x

1 y1 = h1(x) =argmaxy1

p(y1|x) = 1

2 y2 = h2(x, y1) = . . . = 0

3 y3 = h3(x, y1, y2) = . . . = 1

Improves over BR; similar build time (if L < D);

able to use any off-the-shelf classifier for hj ; parralelizable

But, errors may be propagated down the chain

Page 32: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Example

0

0

0

1

1

0

1

1

0

0

0.4

10.60.7

1

0

1

0.6

y = h(x) = [1, 0, 1]

y3y2y1

x

1 y1 = h1(x) =argmaxy1

p(y1|x) = 1

2 y2 = h2(x, y1) = . . . = 03 y3 = h3(x, y1, y2) = . . . = 1

Improves over BR; similar build time (if L < D);

able to use any off-the-shelf classifier for hj ; parralelizable

But, errors may be propagated down the chain

Page 33: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Example

0

0

0

1

1

0

1

1

0

0

10.60.7

1

0

1

0.6

y = h(x) = [1, 0, 1]

y3y2y1

x

1 y1 = h1(x) =argmaxy1

p(y1|x) = 1

2 y2 = h2(x, y1) = . . . = 03 y3 = h3(x, y1, y2) = . . . = 1

Improves over BR; similar build time (if L < D);

able to use any off-the-shelf classifier for hj ; parralelizable

But, errors may be propagated down the chain

Page 34: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Monte-Carlo search for CCExample

0

0

0

1

1

0

0.9

1

0.8

0.4

1

0

0

10.60.7

1

0

1

0.6

Sample T times . . .

p([1, 0, 1]) = 0.6 · 0.7 · 0.6 =0.252

p([0, 1, 0]) = 0.4 · 0.8 · 0.9 =0.288

return argmaxytp(yt |x)

Tractable, with similar accuracy to (Bayes Optimal) PCC

Can use other search algorithms, e.g., beam search

Page 35: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Does Label-order Matter?

y4y3y2y1

x

vs

y1y3y2y4

x

In theory, models are equivalent, since

p(y|x) = p(y1|x)p(y2|y1, x) = p(y2|x)p(y1|y2, x)

but we are estimating p from finite and noisy data; thus

p(y1|x)p(y2|y1, x) 6= p(y2|x)p(y1|y2, x)

and in the greedy case,

p(y2|y1, x) ≈ p(y2|y1, x) = p(y2|y1 = argmaxy1

p(y1|x)|x)

Page 36: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Does Label-order Matter?In theory, models are equivalent, since

p(y|x) = p(y1|x)p(y2|y1, x) = p(y2|x)p(y1|y2, x)

but we are estimating p from finite and noisy data; thus

p(y1|x)p(y2|y1, x) 6= p(y2|x)p(y1|y2, x)

and in the greedy case,

p(y2|y1, x) ≈ p(y2|y1, x) = p(y2|y1 = argmaxy1

p(y1|x)|x)

The approximations cause high variance on account of errorpropagation. We can

1 can reduce variance with an ensemble of classifier chains2 we can search space of chain orders (huge space, but a

little search makes a difference)

Page 37: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Label Powerset Method (LP)

One multi-class problem (taking many values),

y1, y2, y3, y4

x

y = argmaxy∈Y

p(y|x) • where |Y| ≤ {0, 1}L

Each value is a label vector, 2L in total, but

typically, only the occurrences of the training set.

(in practice, |Y| ≤size of training set, and |Y| � 2L)

Page 38: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Label Powerset Method (LP)1 Transform dataset . . .

X Y1 Y2 Y3 Y4

x(1) 0 1 1 0x(2) 1 0 0 0x(3) 0 1 1 0x(4) 1 0 0 1x(5) 0 0 0 1. . . into a multi-class problem, taking 2L possible values:

X Y ∈ 2L

x(1) 0110x(2) 1000x(3) 0110x(4) 1001x(5) 0001

2 . . . and train any off-the-shelf multi-class classifier.

Page 39: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Issues with LP

complexity (up to 2L combinations)

imbalance: few examples per class label

overfitting: how to predict new value?

Example

In the Enron dataset, 44% of labelsets are unique (to a singletraining example or test instance). In del.icio.us dataset, 98%are unique.

Page 40: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Meta LabelsImproving the label-powerset approach:

decomposition of label set into M subsets of size k (k < L)

pruning, such that, e.g., Y1,2 ∈ {[0, 0], [0, 1], [1, 1]}combine together with random subspace method with avoting scheme

Y3Y2Y1

Y1,2 Y2,3

X3X2X1

Page 41: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Meta Labels

Improving the label-powerset approach:

decomposition of label set into M subsets of size k (k < L)

pruning, such that, e.g., Y1,2 ∈ {[0, 0], [0, 1], [1, 1]}combine together with random subspace method with avoting scheme

Method Inference ComplexityLabel Powerset O(2L ·D)Pruned Sets O(P ·D)

Decomposition / RAkEL O(M · 2k ·D)Meta Labels O(M · P ·D′)

where P < 2L and P < 2k , D′ < D.

Page 42: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Summary of Mehtods

Two views of a multi-label problem of L labels:1 L binary problems2 a multi-class problem with up to 2L classes

Problem Transformation:1 Transform data into subproblems (binary or multi-class)2 Apply some off-the-shelf base classifier

or, Algorithm Adaptation:1 Take a suitable single-label classifier (kNN, neural

networks, decision trees . . . )2 Adapt it (if necessary) for multi-label classification

Page 43: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Outline

1 Introduction

2 Applications

3 Methods

4 Label Dependence

5 Multi-label Classification in Data Streams

Page 44: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Label Dependence in MLCCommon approach: Present methods to

1 measure label dependence2 find a structure that best represents this

and then apply classifiers, compare results to BR.

Example

y2y4y1y3

xx

y4y3y2y1

y124 y234

x

Link particular labels (nodes) together (CC-basedmethods)

Form particular label subsets (LP-based methods)

Page 45: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Label Dependence in MLCCommon approach: Present methods to

1 measure label dependence2 find a structure that best represents this

and then apply classifiers, compare results to BR.

Measuring label dependence is expensive, models built on itoften do not improve over models built on random depen-dence!

Problem

For some metrics (such as Hamming-loss / label accuracy),knowledge of label dependence is theoretically unnecessary!

Problem

Page 46: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Marginal label dependence

Marginal dependence

When the joint is not the product of the marginals, i.e.,

p(y2) 6= p(y2|y1)

p(y1)p(y2) 6= p(y1, y2)Y1 Y2

Estimate from co-occurrence frequencies in training data

Page 47: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Marginal label dependenceExample

amazed happy relaxing quiet sad angry

amazed

happy

relaxing

quiet

sad

angry

Figure: Music dataset - Mutual Information

Page 48: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Marginal label dependenceExample

beach sunset foliage field mountain urban

beach

sunset

foliage

field

mountain

urban

Figure: Scene dataset - Mutual Information

Page 49: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Marginal label dependence

Marginal dependence

When the joint is not the product of the marginals, i.e.,

p(y2) 6= p(y2|y1)

p(y1)p(y2) 6= p(y1, y2)Y1 Y2

Estimate from co-occurrence frequencies in training data

Used for regularization/constraints:1 y = h(x) makes a prediction2 y′ = g(y) regularizes the prediction

Page 50: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Conditional label dependence

But at classification time, we condition on the input!

Conditional dependence

. . . conditioned on input observation x.

p(y2|y1, x) 6= p(y2|x) X

Y1 Y2

Have to build and measure models

Indication of conditional dependence if

the performance of LP/CC exeeds that of BR

errors among the binary models are correlated

But what does this mean?

Page 51: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Conditional label dependence

But at classification time, we condition on the input!

Conditional independence

. . . conditioned on input observation x.For example,

p(y2) 6= p(y2|y1)

but p(y2|x) = p(y2|, y1, x)

X

Y1 Y2 vs

X

Y1 Y2

Have to build and measure models

Indication of conditional dependence if

the performance of LP/CC exeeds that of BR

errors among the binary models are correlated

But what does this mean?

Page 52: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

The LOGICAL Problem

Example (The LOGICAL Toy Problem)

OR

AN

D

XO

R

X1 X2 Y1 Y2 Y3

0 0 0 0 01 0 1 0 10 1 1 0 11 1 1 1 0

Each label is a logical operation (independent of the others!)

Page 53: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

The LOGICAL Problem

or and xor

x

xorandor

x

or,and,xor

x

Figure: BR (left), CC (middle), LP (right)

Table: The LOGICAL problem, base classifier logistic regression.

Metric BR CC LP

HAMMING SCORE 0.83 1.00 1.00EXACT MATCH 0.50 1.00 1.00

Dependence is introduced by an inadequate model!Dependence depends on the model.

Page 54: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

The LOGICAL Problem

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2x1

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x2

Model YOR|x1 ,x2

YOR=0

YOR=1

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2x1

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x2

Model YAND|x1 ,x2

YAND=0

YAND=1

or and xor

x

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2x1

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x2

Model YXOR|x1 ,x2

YXOR=0

YXOR=1

Figure: Binary Relevance (BR): linear decision boundary (solid line,estimated with logistic regression) not viable for YXOR label

Page 55: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Solution via Structure

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2x1

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x2

Model YOR|x1 ,x2

YOR=0

YOR=1

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2x1

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x2

Model YAND|x1 ,x2

YAND=0

YAND=1

xorandor

x

x1

0.20.0

0.20.4

0.60.8

1.01.2

x 2

0.20.0

0.20.4

0.60.8

1.01.2

y OR

0.20.00.20.4

0.6

0.8

1.0

1.2

1.4

1.6

Model YXOR|YXOR,x1 ,x2

Figure: Classifier chains (CC): linear model now applicable to YXOR

Page 56: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Solution via Multi-classDecomposition

0.2 0.0 0.2 0.4 0.6 0.8 1.0 1.2x1

0.2

0.0

0.2

0.4

0.6

0.8

1.0

1.2

x2

Model YOR,AND,XOR|x1 ,x2

Y[OR,AND,XOR] =[0,0,0]

Y[OR,AND,XOR] =[1,0,1]

Y[OR,AND,XOR] =[1,1,0]

or,and,xor

x

Figure: Label Powerset (LP): solvable with one-vs-one multi-classdecomposition for any (e.g., linear) base classifier.

Page 57: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Solution via Con. Independence

1.0 1.1 1.2 1.3 1.4 1.5 1.6z1

0.6

0.7

0.8

0.9

1.0

1.1

1.2

1.3

z 2

Model YOR|,z1 ,z2

YOR=0

YOR=1

1.20 1.25 1.30 1.35 1.40 1.45 1.50 1.55 1.60z1

1.00

1.02

1.04

1.06

1.08

1.10

1.12

z 2

Model YAND|,z1 ,z2

YAND=0

YAND=1

or and xor

φ(x)

1.30 1.35 1.40 1.45 1.50 1.55 1.60z1

0.4

0.6

0.8

1.0

1.2

1.4

1.6

z 2

Model YXOR|,z1 ,z2

YXOR=0

YXOR=1

Figure: Solution via latent structure (e.g., random RBF) to new inputspace z; creating independence: p(yXOR|z, yOR, yAND) ≈ p(yXOR|z).

Page 58: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Solution via SuitableBase-classifier

x1 = 1

x2 = 1 x2 = 1

[0, 1, 1] [1, 1, 0][0, 0, 0] [0, 1, 1]

noyes

no yesno yes

Figure: Solution via non-linear classifier (e.g., Decision Tree). Leaveshold examples, where y = [yAND, yOR, yXOR]

Page 59: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Detecting DependenceConditional label dependence and the choice of base modelare inseperable.

yj = hj(x) + εj

yk = hk(x) + εk

0.00 0.02 0.04 0.06 0.08 0.10 0.12εOR

0.00

0.05

0.10

0.15

0.20

0.25

0.30

ε XOR

ε = [εOR, εXOR]

0.0 0.2 0.4 0.6 0.8 1.0εOR

0.0

0.2

0.4

0.6

0.8

1.0

ε XOR

ε = [εOR, εXOR]

Figure: Errors from logistic regression (left) and decision tree (right).

Only dependence is captured!

Page 60: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

A fresh look at ProblemTransformation

X1 X2 X3

Y1

Y2

Y3

Y3Y2Y1

Y1,2 Y2,3

X3X2X1

Figure: Standard methods can be viewed as (‘deep’/cascaded) basisfunctions on the label space.

Page 61: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Label Dependence: Summary

Marginal dependence for regularizationConditional dependence

. . . depends on the model

. . . may be introduced

Should consider together:base classifierlabel structureinner-layer structure

An open problem

Much existing research is relevant (latent-variablemodels, neural networks, deep learning, . . . )

Page 62: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Outline

1 Introduction

2 Applications

3 Methods

4 Label Dependence

5 Multi-label Classification in Data Streams

Page 63: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Classification in Data Streams

Setting:

sequence is potentially infinite

high speed of arrival

stream is one-way

Implications

work in limited memory

adapt to concept drift

Page 64: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Multi-label Streams Methods

1 Batch-incremental Ensemble2 Problem transformation with an incremental base learner3 Multi-label kNN4 Multi-label incremental decision trees5 Neural networks

Page 65: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Batch-Incremental EnsembleBuild regular multi-label models on batches/windows ofinstances (typically in a [weighted] ensemble).

[xy

]1,

[xy

]2,

[xy

]3,

[xy

]4︸ ︷︷ ︸

h1

,

[xy

]5,

[xy

]6,

[xy

]7,

[xy

]8︸ ︷︷ ︸

h2

,

[xy

]9,

[xy

]10

A common approach in the literature, and can besurprisingly effective

Free choice of base classifier (e.g., C4.5, SVM)What batch size to use?

Too small = models insufficientToo large = slow to adaptToo many batches = too slow

Page 66: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Problem Transformation withIncremental Base Learner

Use an incremental learner (Naive Bayes, SGD, Hoeffdingtrees) with any problem transformation method (BR, LP, CC,. . . )

y4y3y2y1

x

Simple implementation,

Risk of overfitting (e.g.,with classifier chains)

Concept drift mayinvalidate structure

Limited choice of baselearner(must be incremental)

Page 67: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Multi-label kNNMaintain a dynamic buffer of instances, compare each testinstance x to the k neighbouring instances,

yj =

{1(

1k

∑i|x(i)∈Ne(x) y(i)

j > 0.5)

0 otherwise

efficient wrt L

. . . but not wrt D

limited buffer size,

not suitable for allproblems

Page 68: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

ML Incremental Decision TreesA small sample can suffice to choose a splitting attribute(Hoeffding bound gives guarantees)As in regular tree, with modified splitting criteria, e.g.,

HML(S) = −L∑

j=1

∑c∈{0,1}

P(yj = c) log2 P(yj = c)

Examples with multiple labels collect at the leaves.

x1 = 1

[0, 0, 0], [0, 1, 1] x2 = 1

[0, 1, 1] [1, 1, 0]

noyes

no yes

Page 69: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

ML Incremental Decision TreesFast, and usually competitive,

But tree may grow very conservatively,

. . . and need to replace it (or part of it) when conceptchanges.

x1 = 1

[0, 0, 0], [0, 1, 1] x2 = 1

[0, 1, 1] [1, 1, 0]

noyes

no yes

Place multi-label classifiers at the leaves of the tree

and wrap it in an ensemble.

Page 70: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Neural networksEach label is a node. Trained with SGD

y1 y2 y3 y4

x5x4x3x2x1

y = W>x

g = ∇E(W)

W (t+1)j,k = W (t)

j,k + λgj,k

Can be applied natively

One layer = BR, should use hidden layers to model labeldependence / improve performance

Hyper-parameter tuning can be tedious

Relatively poor performance in empirical comparisons onstandard data streams (improving now with recentadvances in SGD, more common use of basis expansion)

Page 71: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Multi-label Data Streams: Issues

Overfitting

Class imbalance

Multi-dimensional concept drift

Labelled examples difficult to obtain (semi-supervised)

Dynamic label set

Time dependence

Page 72: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Multi-label Concept Drift

Consider the relative frequencies of the j-th and k-th labels:[pj pj,k

pk

](if marginal independence then pj,k = pjpk).

Possible drift:

pj increases (j-th label relatively more frequent)

pj and pk both decrease (label cardinality decreasing)

pj,k changes relative to pjpk (change in marginaldependence relation between the labels)

Page 73: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Multi-label Concept Drift

And when conditioned on input x, we consider the relativefrequencies/values of the j-th and k-th errors:[

εj εj,k

εk

](if conditional independence, then εj,k ≈ εj · εk).

Possible drift:

εj increases (more errors on j-th label)

εj and εk both increase (more errors)

εj,k changes relative to εj , εk (change in conditionaldependence relation)

Page 74: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Example

Recall the distribution of errors

0.00 0.02 0.04 0.06 0.08 0.10 0.12εOR

0.00

0.05

0.10

0.15

0.20

0.25

0.30ε X

OR

ε = [εOR, εXOR]

This shape may change over time – and structures may needto be adjusted to cope

Page 75: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Dealing with Concept Drift

Possible approaches

Just ignore it – batch models must be replaced anyway,kNN and SGD adapt; in other cases can use weightedensembles/fading factor

Monitor a predictive performance statistic with a changedetector (e.g., window based-detection, ADWIN) andreset models

Monitor the distribution with a change detector (e.g.,window based, KL divergence) and reset/recalibratemodels

(similar to single-labelled data, except more complexmeasurement)

Page 76: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Dealing with UnlabelledInstances

Ignore instances with no label

Use active learning to get good labels

Use predicted labels (self-training)

Use an unsupervised process for example clustering,latent-variable representations.

Page 77: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Dealing with UnlabelledInstances

Use an unsupervised process for example clustering,latent-variable representations.

1 zt = g(xt )2 yt = h(zt )3 update g with (xt , zt )4 update h with (zt−1, yt−1) (if yt−1 is available)

yt−1

update(·,·)

ytOOpredict(·)

zt−1 ztOOpropagate(·)update(·,·)

xt

Can also be as one single model

Page 78: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Summary

Multi-label classification is an active area of research,relevant to many real-world problems

Methods that deal appropriately wiith label dependencecan achieve significant gains over a naive approach

Many multi-label problems come in the form of a datastream, incurring particular challenges

Page 79: Multi-label learning from batch and streaming data · Introduction K = 2 K >2 L = 1 binary multi-class L >1 multi-label multi-outputy y also known as multi-target, multi-dimensional

Multi-label learning from batch andstreaming data

Jesse Read

Telecom ParisTech

Ecole Polytechnique

Summer School on Mining Big and Complex Data5 September 2016 — Ohrid, Macedonia