august 16, 2015eecs, osu1 learning with ambiguously labeled training data kshitij judah ph.d....

63
March 21, 2022 EECS, OSU 1 Learning with Ambiguously Labeled Training Data Kshitij Judah Ph.D. student Advisor: Prof. Alan Fern Qualifier Oral Presentation

Upload: sybil-chapman

Post on 24-Dec-2015

214 views

Category:

Documents


2 download

TRANSCRIPT

April 19, 2023 EECS, OSU 1

Learning with Ambiguously Labeled Training Data

Kshitij JudahPh.D. studentAdvisor: Prof. Alan FernQualifier Oral Presentation

April 19, 2023 EECS, OSU 2

Outline Introduction and Motivation

Problem Definition

Learning with Partial Labels

Semi-Supervised Learning

Multiple-Instance Learning

Conclusion

April 19, 2023 EECS, OSU 3

Introduction and Motivation

Training Data (D)

Learning Algorithm

f

LossFunction

Test Point

Supervised Learning

(From CS534 Machine Learning slides)

April 19, 2023 EECS, OSU 4

Introduction and Motivation Obtaining labeled training data is difficult

Reasons: Require substantial human effort (Information

Retrieval)

Require expensive tests (Medical Diagnosis, Remote Sensing)

Disagreement among experts (Information Retrieval, Remote Sensing)

April 19, 2023 EECS, OSU 5

Introduction and Motivation Reasons:

Specifying a single correct class is not possible but pointing incorrect classes is possible (Medical Diagnosis)

Labeling is not possible at instance level (Drug Activity Prediction)

Objective: Present the space of learning problems and algorithms that deal with ambiguously labeled training data

April 19, 2023 EECS, OSU 6

Space of Learning ProblemsLearningProblems

Supervised Learning

Semi-Supervised Learning

Direct Supervision (Labels)

Indirect Supervision (No Labels)

Reinforcement Learning

ProbabilisticLabels

DeterministicLabels

Learning withPartial Labels

Session-based Learning

Unsupervised Learning

Multiple-instance Learning

April 19, 2023 EECS, OSU 7

Problem Definition Let denote an instance

space, being domain of the ith feature

Instance is a vector

denotes a discrete valued class variable with domain the set of all classes

Training data is a set of examples an

dis a vector of length m such that kth entry is 1 if is a possible class of , otherwise 0

In case of unambiguously labeled , exactly one entry in is 1

April 19, 2023 EECS, OSU 8

Problem Definition In case of ambiguous labeling, more than one entries can be 1

The learning task is to select using D an optimal hypothesis

from a hypothesis space

is used to classify a new instance

Optimality refers to expected classification accuracy

April 19, 2023 EECS, OSU 9

Learning with Partial Labels

April 19, 2023 EECS, OSU 10

Problem Definition

Each training instance has a set of candidate class labels

associated with it

is assumed to contain true label of

Different from multiple-label learning because each instance belongs to exactly one class

Applications: Medical diagnosis, Bioinformatics, information retrieval.

April 19, 2023 EECS, OSU 11

Taxonomy of Algorithms for Learning with Partial Labels

Probabilistic Generative Approaches Mixture Models + EM

Probabilistic Discriminative Approaches Logistic Regression for Partial Labels

April 19, 2023 EECS, OSU 12

Probabilistic Generative Approaches

Learn joint probability distribution

can be used to either model data or perform classification

learned using and

Classification of is performed using Bayes rule:

and assigning to class that maximizes

April 19, 2023 EECS, OSU 13

Probabilistic Generative Approaches

Some examples of generative models are Gaussian Distribution Mixture Models e.g. Gaussian mixture model Directed Graphical Models e.g. Naïve Bayes, HMMs Undirected Graphical Models e.g. Markov Networks

Applications: Robotics, speech recognition, computer vision, forecasting, time series prediction etc.

April 19, 2023 EECS, OSU 14

Mixture Models In a mixture model, the probability density function of

is given by:

Model parameters are often learned using maximum likelihood estimation (MLE)

April 19, 2023 EECS, OSU 15

Mixture Models In MLE approach, assuming instances in D are

independently and identically distributed (i.i.d), the joint density of D is given by

The goal is to find parameters that maximizes the log of above quantity

April 19, 2023 EECS, OSU 16

Expectation-Maximization Algorithm (EM)

Difficult to optimize - log of sums

Incomplete-data likelihood function – Generating component unknown

Convert to complete-data likelihood by introducing a random variable which tells, for each , which component generated it

If we observe ,

April 19, 2023 EECS, OSU 17

Expectation-Maximization Algorithm (EM)

We do not however observe directly, it is a hidden variable

Use EM to handle hidden variables

EM is an iterative algorithm with two basic steps: E-Step: Given current estimate of model parameters and data

D, the expected value is calculated where expectation is to the marginal distribution of (i.e. each ):

April 19, 2023 EECS, OSU 18

Expectation-Maximization Algorithm (EM)

M-Step: In this, we maximize (8) computed in E-step:

Repeat until convergence.

April 19, 2023 EECS, OSU 19

Mixture Models + EM for Partial Labels

- set of possible mixture components that could have generated the sample

EM is modified so that is restricted to take values from Modified E-Step

if

April 19, 2023 EECS, OSU 20

Pros and Cons of Generative Models Pros

Clear probabilistic framework Effective when model assumptions are correct

Cons Complex, lot of parameters Difficult to specify and verify their correctness Unlabeled data can hurt performance if model is incorrect Parameter estimation is difficult EM gets stuck in local maxima, computationally expensive No guarantee of better classification

April 19, 2023 EECS, OSU 21

Discriminative Approaches Discriminative approaches attempt to directly learn a mapping required for classification

Simpler than generative models

More focused towards classification or regression tasks

Two types: Probabilistic: They learn posterior class probability Non-probabilistic: They learn a classifier

Applications - text classification, information retrieval, machine perception, time series prediction, bioinformatics etc.

April 19, 2023 EECS, OSU 22

Logistic Regression (From CS534 Machine Learning slides)

Learns the conditional class distribution

For two class case {0, 1}, the probabilities are given as:

For 0-1 loss function LR is a linear classifier

For multiple class case, we learn weight parameters for each class

April 19, 2023 EECS, OSU 23

Maximum Entropy Logistic Regression

Learns such that it has maximum entropy while being consistent with the partial labels in the D

is close to uniform distribution over classes in candidate set and zero everywhere else

Presence of unlabeled data results in with low discrimination capacity

Disadvantages: Does not use unlabeled data to enhance learning Instead uses unlabeled data in unfavorable way that decreases

discrimination capacity of

April 19, 2023 EECS, OSU 24

Minimum Commitment Logistic Regression

Learns such that classes belonging to candidate set are predicted with high probability

No preference is made among distributions that satisfy above property

Presence of unlabeled data has no effect on discrimination capacity of

Disadvantages : Does not use unlabeled data to enhance learning

April 19, 2023 EECS, OSU 25

Self-Consistent Logistic Regression

Learns such that the entropy of the distribution is minimized

Learning is driven by the predictions that are produced by the learner itself

Advantage: Can make use of unlabeled data to enhance learning

Disadvantage: If initial estimate is incorrect, the algorithm will go down the wrong

path

April 19, 2023 EECS, OSU 26

Semi-Supervised Learning

April 19, 2023 EECS, OSU 27

Problem Definition

Semi-supervised learning (SSL) is a special case of learning with partial labels

, and For each in , = true class label of For each in , = Usually because unlabeled data is easily available

as compared to labeled data e.g. information retrieval SSL approaches makes use of in conjunction with to

enhance learning SSL has been applied to many domains like text classification,

image classification, audio classification etc.

April 19, 2023 EECS, OSU 28

Taxonomy of Algorithms for Semi-Supervised Learning

Probabilistic Generative Approaches Mixture Models + EM

Discriminative Approaches Semi-supervised support vector machines (S3VMs)

Self Training

Co-Training

April 19, 2023 EECS, OSU 29

Mixture Models + EM for SSL

Nigam et al. proposed a probabilistic generative framework for document classification

Assumptions: (1) Documents are generated by a mixture model (2) One-to-one correspondence between mixture components and

classes

Probability of a document is given as:

was modeled using Naïve Bayes

April 19, 2023 EECS, OSU 30

Mixture Models + EM for SSL

is estimated using MLE by maximizing log likelihood of :

where

EM is used to maximize

Applications: Text classification, remote sensing, face orientation discrimination, pose estimation, audio classification

April 19, 2023 EECS, OSU 31

Mixture Models + EM for SSL

Problem: Unlabeled data decreased performance

Reason: Incorrect modeling assumption One mixture component per class: Class includes multiple

sub-topics

Solutions: Decrease effect of unlabeled data

Correct modeling assumptions: Multiple mixture components per class

April 19, 2023 EECS, OSU 32

SVMs

Class -1

Class +1margin

April 19, 2023 EECS, OSU 33

SVMs

Idea: Do following things:A. Maximize margin B. Classify points correctlyC. Keep data points outside margin

SVM optimization problem:

April 19, 2023 EECS, OSU 34

SVMs

A. is achieved by B. and C. are achieved using hinge loss

C controls trade off between training error and generalization error

From SSL tutorial by Xiaojin Zhu during ICML 2007

April 19, 2023 EECS, OSU 35

S3VMs

From SSL tutorial by Xiaojin Zhu during ICML 2007

Also known as Transductive SVMs

April 19, 2023 EECS, OSU 36

S3VMs

Idea: Do following things:A. Maximize margin B. correctly classify labeled data C. Keep data points (labeled and unlabeled) outside

margin

S3VM optimization problem:

April 19, 2023 EECS, OSU 37

S3VMs

A. is achieved by B. is achieved using hinge loss C. is achieved for labeled data using hinge loss and for

unlabeled data using hat loss

From SSL tutorial by Xiaojin Zhu during ICML 2007

April 19, 2023 EECS, OSU 38

S3VMs

Pros Same mathematical foundations as SVMs and hence

applicable to tasks where SVMs are applicable

Cons Optimization is difficult, finding exact solution is NP-Hard

Applications: text classification, Bioinformatics, biological entity recognition, image retrieval etc.

April 19, 2023 EECS, OSU 39

Self Training

A leaner uses its own predictions to teach itself

Self training algorithm:1. Train a classifier using 2. Use h to predict labels of 3. Add most confidently labeled instances in to 4. Repeat until a stopping criterion is met

April 19, 2023 EECS, OSU 40

Self Training

Pros One of the simplest SSL algorithms Can be used with any classifier

Cons Errors in initial predictions can build up and affect

subsequent learning. A possible solution: identify and remove mislabeled examples

from self-labeled training set during each iteration. E.g. SETRED algorithm

[Ming Li and Zhi-Hua Zhou]

Applications: object detection in images, word sense disambiguation, semantic role labeling, named entity recognition

April 19, 2023 EECS, OSU 41

Co-Training

Idea: Two classifiers and collaborate with each other during learning

and use disjoint subsets and of feature set

and each is assumed to be: Sufficient for the classification task at hand Conditionally independent of the other given class

April 19, 2023 EECS, OSU 42

Co-Training

Example: Web page classification

can be text present on the web page

can be anchor text on pages that link to this page

The two feature sets are sufficient for classification and conditionally independent given class of web page because each is generated by two different persons that know the class

April 19, 2023 EECS, OSU 43

Co-Training

Co-Training algorithm:1. Use and to train ; and to train 2. Use to predict labels of 3. Use to predict labels of 4. Add h1‘s most confidently labeled instances to

5. Add h2‘s most confidently labeled instances to

6. Repeat until a stopping criterion is met

April 19, 2023 EECS, OSU 44

Co-Training

Pros Effective when required feature split exists Can be used with any classifier

Cons What if feature split does not exist? Answer: (1) Use random

feature split (2) Use full feature set

How to select confidently labeled instances? Answer: Use multiple classifiers instead of two and do majority voting. E.g. tri-training uses three classifiers [Zhou and Li], democratic co-learning uses multiple classifiers [Zhou and Goldman]

Applications: Web page classification, named entity recognition, semantic role labeling

April 19, 2023 EECS, OSU 45

Multiple-Instance Learning

April 19, 2023 EECS, OSU 46

Problem Definition

Set of objects

Each object has multiple variants called instances

, is called a bag

Bag is positive if at least one instance is positive else negative

Goal: learn

April 19, 2023 EECS, OSU 47

Why MIL is ambiguous label problem?

To learn , learn

maps an instance to a class

does not contain labels at instance level

April 19, 2023 EECS, OSU 48

Taxonomy of Algorithms for Multiple-Instance Learning

Probabilistic Generative Approaches Diverse Density Algorithm (DD) EM Diverse Density (EM-DD)

Probabilistic Discriminative Approaches Multiple-instance logistic regression

Non-probabilistic Discriminative Approaches Axis-Parallel Rectangles (APR) Multiple-instance SVM (MISVM)

April 19, 2023 EECS, OSU 49

Diverse Density Algorithm (DD)

1

1

1

1

1

11

11

11

22

2

22

2

2

2

2

2

2

2

22

2

2

2

2

2

3

3

3

33 33 33

3

3

3

4

4

4

4

4

4

44

Point B

Point A

x1

x2

231

2

April 19, 2023 EECS, OSU 50

Diverse Density Algorithm (DD)

Diverse density is a probabilistic quantity:

Using gradient ascent, maximum diverse density point can be located

Cons: Computationally expensive

Applications: person identification, stock selection, natural scene classification

April 19, 2023 EECS, OSU 51

EM Diverse Density (EM-DD)

Views knowledge of which instance is responsible for the bag label as a missing data problem

EM to maximize

EM-DD algorithm: E-Step: Given current estimate of target location, use

to find the most likely instance responsible for bag label

M-Step: Find

April 19, 2023 EECS, OSU 52

EM Diverse Density (EM-DD)

Pros Computationally less expensive than DD

Avoids local maxima

Shown to outperform DD and other MIL algorithms on musk data and artificial real-valued data

April 19, 2023 EECS, OSU 53

Multiple-Instance Logistic Regression

Learn a logistic regression classifier for instances

Use softmax function to combine the output probabilities to get bag probabilities

MIL property satisfied: A bag has high probability of being positive if an instance has high probability of being positive

April 19, 2023 EECS, OSU 54

Axis-Parallel Rectangles (APR)

Proposed by Dietterich et al. for drug activity prediction

Three algorithms GFS elim-count APR GFS elim-kde APR Iterated discrimination APR

April 19, 2023 EECS, OSU 55

Axis-Parallel Rectangles (APR)

x1

x2

April 19, 2023 EECS, OSU 56

Iterated Discrimination APR

Inside-out algorithm: Start with single positive instance and grow APR to include additional positive instances

Three basic steps: Grow: Construct smallest APR that covers at least one

instance from each positive bag Discriminate: Select most discriminating features using

APR Expand: Expand final APR to improve generalization

Works in two phases

April 19, 2023 EECS, OSU 57

Iterated Discrimination APR

Grow

x1

x2

April 19, 2023 EECS, OSU 58

Iterated Discrimination APR

Discriminate Greedily select feature with highest discrimination power Discrimination power depends on:

How many negative instances are outside How far they are outside

Expand Expand the final APR to improve generalization using

kernel density estimation

April 19, 2023 EECS, OSU 59

Multiple-instance SVM (MISVM)

Based on the idea similar to S3VMs

S3VMs: No constraint on how unlabeled data disambiguates

MISVM: Maintain MIL constraint All instances from negative bag as negative At least one instance from each positive bag as positive

Find maximum margin classifier with MIL constraint satisfied

April 19, 2023 EECS, OSU 60

Multiple-instance SVM (MISVM)

MISVM optimization problem:

April 19, 2023 EECS, OSU 61

Multiple-instance SVM (MISVM)

Two notions of margin: instance level and bag level

Figure from [Andrews et al., NIPS 2002]

April 19, 2023 EECS, OSU 62

Conclusion

Motivated problem of learning from ambiguously labeled data

Studied space of such learning problems

Presented a taxonomy of proposed algorithms for each problem

April 19, 2023 EECS, OSU 63

Thank you