august 16, 2015eecs, osu1 learning with ambiguously labeled training data kshitij judah ph.d....
TRANSCRIPT
April 19, 2023 EECS, OSU 1
Learning with Ambiguously Labeled Training Data
Kshitij JudahPh.D. studentAdvisor: Prof. Alan FernQualifier Oral Presentation
April 19, 2023 EECS, OSU 2
Outline Introduction and Motivation
Problem Definition
Learning with Partial Labels
Semi-Supervised Learning
Multiple-Instance Learning
Conclusion
April 19, 2023 EECS, OSU 3
Introduction and Motivation
Training Data (D)
Learning Algorithm
f
LossFunction
Test Point
Supervised Learning
(From CS534 Machine Learning slides)
April 19, 2023 EECS, OSU 4
Introduction and Motivation Obtaining labeled training data is difficult
Reasons: Require substantial human effort (Information
Retrieval)
Require expensive tests (Medical Diagnosis, Remote Sensing)
Disagreement among experts (Information Retrieval, Remote Sensing)
April 19, 2023 EECS, OSU 5
Introduction and Motivation Reasons:
Specifying a single correct class is not possible but pointing incorrect classes is possible (Medical Diagnosis)
Labeling is not possible at instance level (Drug Activity Prediction)
Objective: Present the space of learning problems and algorithms that deal with ambiguously labeled training data
April 19, 2023 EECS, OSU 6
Space of Learning ProblemsLearningProblems
Supervised Learning
Semi-Supervised Learning
Direct Supervision (Labels)
Indirect Supervision (No Labels)
Reinforcement Learning
ProbabilisticLabels
DeterministicLabels
Learning withPartial Labels
Session-based Learning
Unsupervised Learning
Multiple-instance Learning
April 19, 2023 EECS, OSU 7
Problem Definition Let denote an instance
space, being domain of the ith feature
Instance is a vector
denotes a discrete valued class variable with domain the set of all classes
Training data is a set of examples an
dis a vector of length m such that kth entry is 1 if is a possible class of , otherwise 0
In case of unambiguously labeled , exactly one entry in is 1
April 19, 2023 EECS, OSU 8
Problem Definition In case of ambiguous labeling, more than one entries can be 1
The learning task is to select using D an optimal hypothesis
from a hypothesis space
is used to classify a new instance
Optimality refers to expected classification accuracy
April 19, 2023 EECS, OSU 10
Problem Definition
Each training instance has a set of candidate class labels
associated with it
is assumed to contain true label of
Different from multiple-label learning because each instance belongs to exactly one class
Applications: Medical diagnosis, Bioinformatics, information retrieval.
April 19, 2023 EECS, OSU 11
Taxonomy of Algorithms for Learning with Partial Labels
Probabilistic Generative Approaches Mixture Models + EM
Probabilistic Discriminative Approaches Logistic Regression for Partial Labels
April 19, 2023 EECS, OSU 12
Probabilistic Generative Approaches
Learn joint probability distribution
can be used to either model data or perform classification
learned using and
Classification of is performed using Bayes rule:
and assigning to class that maximizes
April 19, 2023 EECS, OSU 13
Probabilistic Generative Approaches
Some examples of generative models are Gaussian Distribution Mixture Models e.g. Gaussian mixture model Directed Graphical Models e.g. Naïve Bayes, HMMs Undirected Graphical Models e.g. Markov Networks
Applications: Robotics, speech recognition, computer vision, forecasting, time series prediction etc.
April 19, 2023 EECS, OSU 14
Mixture Models In a mixture model, the probability density function of
is given by:
Model parameters are often learned using maximum likelihood estimation (MLE)
April 19, 2023 EECS, OSU 15
Mixture Models In MLE approach, assuming instances in D are
independently and identically distributed (i.i.d), the joint density of D is given by
The goal is to find parameters that maximizes the log of above quantity
April 19, 2023 EECS, OSU 16
Expectation-Maximization Algorithm (EM)
Difficult to optimize - log of sums
Incomplete-data likelihood function – Generating component unknown
Convert to complete-data likelihood by introducing a random variable which tells, for each , which component generated it
If we observe ,
April 19, 2023 EECS, OSU 17
Expectation-Maximization Algorithm (EM)
We do not however observe directly, it is a hidden variable
Use EM to handle hidden variables
EM is an iterative algorithm with two basic steps: E-Step: Given current estimate of model parameters and data
D, the expected value is calculated where expectation is to the marginal distribution of (i.e. each ):
April 19, 2023 EECS, OSU 18
Expectation-Maximization Algorithm (EM)
M-Step: In this, we maximize (8) computed in E-step:
Repeat until convergence.
April 19, 2023 EECS, OSU 19
Mixture Models + EM for Partial Labels
- set of possible mixture components that could have generated the sample
EM is modified so that is restricted to take values from Modified E-Step
if
April 19, 2023 EECS, OSU 20
Pros and Cons of Generative Models Pros
Clear probabilistic framework Effective when model assumptions are correct
Cons Complex, lot of parameters Difficult to specify and verify their correctness Unlabeled data can hurt performance if model is incorrect Parameter estimation is difficult EM gets stuck in local maxima, computationally expensive No guarantee of better classification
April 19, 2023 EECS, OSU 21
Discriminative Approaches Discriminative approaches attempt to directly learn a mapping required for classification
Simpler than generative models
More focused towards classification or regression tasks
Two types: Probabilistic: They learn posterior class probability Non-probabilistic: They learn a classifier
Applications - text classification, information retrieval, machine perception, time series prediction, bioinformatics etc.
April 19, 2023 EECS, OSU 22
Logistic Regression (From CS534 Machine Learning slides)
Learns the conditional class distribution
For two class case {0, 1}, the probabilities are given as:
For 0-1 loss function LR is a linear classifier
For multiple class case, we learn weight parameters for each class
April 19, 2023 EECS, OSU 23
Maximum Entropy Logistic Regression
Learns such that it has maximum entropy while being consistent with the partial labels in the D
is close to uniform distribution over classes in candidate set and zero everywhere else
Presence of unlabeled data results in with low discrimination capacity
Disadvantages: Does not use unlabeled data to enhance learning Instead uses unlabeled data in unfavorable way that decreases
discrimination capacity of
April 19, 2023 EECS, OSU 24
Minimum Commitment Logistic Regression
Learns such that classes belonging to candidate set are predicted with high probability
No preference is made among distributions that satisfy above property
Presence of unlabeled data has no effect on discrimination capacity of
Disadvantages : Does not use unlabeled data to enhance learning
April 19, 2023 EECS, OSU 25
Self-Consistent Logistic Regression
Learns such that the entropy of the distribution is minimized
Learning is driven by the predictions that are produced by the learner itself
Advantage: Can make use of unlabeled data to enhance learning
Disadvantage: If initial estimate is incorrect, the algorithm will go down the wrong
path
April 19, 2023 EECS, OSU 27
Problem Definition
Semi-supervised learning (SSL) is a special case of learning with partial labels
, and For each in , = true class label of For each in , = Usually because unlabeled data is easily available
as compared to labeled data e.g. information retrieval SSL approaches makes use of in conjunction with to
enhance learning SSL has been applied to many domains like text classification,
image classification, audio classification etc.
April 19, 2023 EECS, OSU 28
Taxonomy of Algorithms for Semi-Supervised Learning
Probabilistic Generative Approaches Mixture Models + EM
Discriminative Approaches Semi-supervised support vector machines (S3VMs)
Self Training
Co-Training
April 19, 2023 EECS, OSU 29
Mixture Models + EM for SSL
Nigam et al. proposed a probabilistic generative framework for document classification
Assumptions: (1) Documents are generated by a mixture model (2) One-to-one correspondence between mixture components and
classes
Probability of a document is given as:
was modeled using Naïve Bayes
April 19, 2023 EECS, OSU 30
Mixture Models + EM for SSL
is estimated using MLE by maximizing log likelihood of :
where
EM is used to maximize
Applications: Text classification, remote sensing, face orientation discrimination, pose estimation, audio classification
April 19, 2023 EECS, OSU 31
Mixture Models + EM for SSL
Problem: Unlabeled data decreased performance
Reason: Incorrect modeling assumption One mixture component per class: Class includes multiple
sub-topics
Solutions: Decrease effect of unlabeled data
Correct modeling assumptions: Multiple mixture components per class
April 19, 2023 EECS, OSU 33
SVMs
Idea: Do following things:A. Maximize margin B. Classify points correctlyC. Keep data points outside margin
SVM optimization problem:
April 19, 2023 EECS, OSU 34
SVMs
A. is achieved by B. and C. are achieved using hinge loss
C controls trade off between training error and generalization error
From SSL tutorial by Xiaojin Zhu during ICML 2007
April 19, 2023 EECS, OSU 35
S3VMs
From SSL tutorial by Xiaojin Zhu during ICML 2007
Also known as Transductive SVMs
April 19, 2023 EECS, OSU 36
S3VMs
Idea: Do following things:A. Maximize margin B. correctly classify labeled data C. Keep data points (labeled and unlabeled) outside
margin
S3VM optimization problem:
April 19, 2023 EECS, OSU 37
S3VMs
A. is achieved by B. is achieved using hinge loss C. is achieved for labeled data using hinge loss and for
unlabeled data using hat loss
From SSL tutorial by Xiaojin Zhu during ICML 2007
April 19, 2023 EECS, OSU 38
S3VMs
Pros Same mathematical foundations as SVMs and hence
applicable to tasks where SVMs are applicable
Cons Optimization is difficult, finding exact solution is NP-Hard
Applications: text classification, Bioinformatics, biological entity recognition, image retrieval etc.
April 19, 2023 EECS, OSU 39
Self Training
A leaner uses its own predictions to teach itself
Self training algorithm:1. Train a classifier using 2. Use h to predict labels of 3. Add most confidently labeled instances in to 4. Repeat until a stopping criterion is met
April 19, 2023 EECS, OSU 40
Self Training
Pros One of the simplest SSL algorithms Can be used with any classifier
Cons Errors in initial predictions can build up and affect
subsequent learning. A possible solution: identify and remove mislabeled examples
from self-labeled training set during each iteration. E.g. SETRED algorithm
[Ming Li and Zhi-Hua Zhou]
Applications: object detection in images, word sense disambiguation, semantic role labeling, named entity recognition
April 19, 2023 EECS, OSU 41
Co-Training
Idea: Two classifiers and collaborate with each other during learning
and use disjoint subsets and of feature set
and each is assumed to be: Sufficient for the classification task at hand Conditionally independent of the other given class
April 19, 2023 EECS, OSU 42
Co-Training
Example: Web page classification
can be text present on the web page
can be anchor text on pages that link to this page
The two feature sets are sufficient for classification and conditionally independent given class of web page because each is generated by two different persons that know the class
April 19, 2023 EECS, OSU 43
Co-Training
Co-Training algorithm:1. Use and to train ; and to train 2. Use to predict labels of 3. Use to predict labels of 4. Add h1‘s most confidently labeled instances to
5. Add h2‘s most confidently labeled instances to
6. Repeat until a stopping criterion is met
April 19, 2023 EECS, OSU 44
Co-Training
Pros Effective when required feature split exists Can be used with any classifier
Cons What if feature split does not exist? Answer: (1) Use random
feature split (2) Use full feature set
How to select confidently labeled instances? Answer: Use multiple classifiers instead of two and do majority voting. E.g. tri-training uses three classifiers [Zhou and Li], democratic co-learning uses multiple classifiers [Zhou and Goldman]
Applications: Web page classification, named entity recognition, semantic role labeling
April 19, 2023 EECS, OSU 46
Problem Definition
Set of objects
Each object has multiple variants called instances
, is called a bag
Bag is positive if at least one instance is positive else negative
Goal: learn
April 19, 2023 EECS, OSU 47
Why MIL is ambiguous label problem?
To learn , learn
maps an instance to a class
does not contain labels at instance level
April 19, 2023 EECS, OSU 48
Taxonomy of Algorithms for Multiple-Instance Learning
Probabilistic Generative Approaches Diverse Density Algorithm (DD) EM Diverse Density (EM-DD)
Probabilistic Discriminative Approaches Multiple-instance logistic regression
Non-probabilistic Discriminative Approaches Axis-Parallel Rectangles (APR) Multiple-instance SVM (MISVM)
April 19, 2023 EECS, OSU 49
Diverse Density Algorithm (DD)
1
1
1
1
1
11
11
11
22
2
22
2
2
2
2
2
2
2
22
2
2
2
2
2
3
3
3
33 33 33
3
3
3
4
4
4
4
4
4
44
Point B
Point A
x1
x2
231
2
April 19, 2023 EECS, OSU 50
Diverse Density Algorithm (DD)
Diverse density is a probabilistic quantity:
Using gradient ascent, maximum diverse density point can be located
Cons: Computationally expensive
Applications: person identification, stock selection, natural scene classification
April 19, 2023 EECS, OSU 51
EM Diverse Density (EM-DD)
Views knowledge of which instance is responsible for the bag label as a missing data problem
EM to maximize
EM-DD algorithm: E-Step: Given current estimate of target location, use
to find the most likely instance responsible for bag label
M-Step: Find
April 19, 2023 EECS, OSU 52
EM Diverse Density (EM-DD)
Pros Computationally less expensive than DD
Avoids local maxima
Shown to outperform DD and other MIL algorithms on musk data and artificial real-valued data
April 19, 2023 EECS, OSU 53
Multiple-Instance Logistic Regression
Learn a logistic regression classifier for instances
Use softmax function to combine the output probabilities to get bag probabilities
MIL property satisfied: A bag has high probability of being positive if an instance has high probability of being positive
April 19, 2023 EECS, OSU 54
Axis-Parallel Rectangles (APR)
Proposed by Dietterich et al. for drug activity prediction
Three algorithms GFS elim-count APR GFS elim-kde APR Iterated discrimination APR
April 19, 2023 EECS, OSU 56
Iterated Discrimination APR
Inside-out algorithm: Start with single positive instance and grow APR to include additional positive instances
Three basic steps: Grow: Construct smallest APR that covers at least one
instance from each positive bag Discriminate: Select most discriminating features using
APR Expand: Expand final APR to improve generalization
Works in two phases
April 19, 2023 EECS, OSU 58
Iterated Discrimination APR
Discriminate Greedily select feature with highest discrimination power Discrimination power depends on:
How many negative instances are outside How far they are outside
Expand Expand the final APR to improve generalization using
kernel density estimation
April 19, 2023 EECS, OSU 59
Multiple-instance SVM (MISVM)
Based on the idea similar to S3VMs
S3VMs: No constraint on how unlabeled data disambiguates
MISVM: Maintain MIL constraint All instances from negative bag as negative At least one instance from each positive bag as positive
Find maximum margin classifier with MIL constraint satisfied
April 19, 2023 EECS, OSU 61
Multiple-instance SVM (MISVM)
Two notions of margin: instance level and bag level
Figure from [Andrews et al., NIPS 2002]
April 19, 2023 EECS, OSU 62
Conclusion
Motivated problem of learning from ambiguously labeled data
Studied space of such learning problems
Presented a taxonomy of proposed algorithms for each problem