Semi-Supervised Learning
Adaptive Computation and Machine Learning
Thomas Dietterich, Editor
Christopher Bishop, David Heckerman, Michael Jordan, and Michael Kearns, As-
sociate Editors
Bioinformatics: The Machine Learning Approach, Pierre Baldi and Søren Brunak
Reinforcement Learning: An Introduction, Richard S. Sutton and Andrew G. Barto
Graphical Models for Machine Learning and Digital Communication, Brendan J.
Frey
Learning in Graphical Models, Michael I. Jordan
Causation, Prediction, and Search, second edition, Peter Spirtes, Clark Glymour,
and Richard Scheines
Principles of Data Mining, David Hand, Heikki Mannila, and Padhraic Smyth
Bioinformatics: The Machine Learning Approach, second edition, Pierre Baldi and
Søren Brunak
Learning Kernel Classifiers: Theory and Algorithms, Ralf Herbrich
Learning with Kernels: Support Vector Machines, Regularization, Optimization, and
Beyond, Bernhard Scholkopf and Alexander J. Smola
Semi-Supervised Learning
Olivier Chapelle
Bernhard Scholkopf
Alexander Zien
The MIT Press
Cambridge, Massachusetts
London, England
c©2005 Massachusetts Institute of Technology
All rights reserved. No part of this book may be reproduced in any form by any electronic
or mechanical means (including photocopying, recording, or information storage and retrieval)
without permission in writing from the publisher.
Typeset by the authors using LATEX2ε
Library of Congress Control No. xxx
Printed and bound in the United States of America
Library of Congress Cataloging-in-Publication Data
Contents
1 A Tutorial Introduction 1
1.1 Supervised, Unsupervised, and Semi-Supervised Learning . . . . . . 1
1.2 When Can Semi-Supervised Learning Work? . . . . . . . . . . . . . 3
1.3 Classes of Algorithms and Organization of This Book . . . . . . . . 6
I Generative Models 11
2 Probabilistic Graphical Models for Semi-Supervised Learning 13
2.1 The Semi-Supervised Learning Problem . . . . . . . . . . . . . . . . 13
2.2 The Sampling Paradigm (Generative Models) . . . . . . . . . . . . . 13
2.3 The Diagnostic Paradigm. Input-Dependent Regularization . . . . . 13
2.4 Examples: Generative Models . . . . . . . . . . . . . . . . . . . . . . 13
2.5 Examples: Input-Dependent Regularization . . . . . . . . . . . . . . 13
2.6 A General Scheme for Input-Dependent Regularization. Relations to
Co-Training . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
3 Probabilistic Semi-Supervised Clustering with Constraints 15
3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.2 Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.4 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.5 Constraint acquisition . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.6 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4 Semi-Supervised Text Classification Using EM 17
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.2 The Basic Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.3 Modeling Sub-topic Structure . . . . . . . . . . . . . . . . . . . . . . 18
4.4 Modeling Super-topic Structure . . . . . . . . . . . . . . . . . . . . . 18
4.5 Finding More Probable Models . . . . . . . . . . . . . . . . . . . . . 18
4.6 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 Risks of Semi-Supervised Learning:
vi Contents
How Unlabeled Data Can Degrade Performance of Generative
Classifiers 19
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.2 Do unlabeled data improve or degrade classification performance? . 20
5.3 The Boy-Girl classifier . . . . . . . . . . . . . . . . . . . . . . . . . . 20
5.4 Insights from asymptotic analysis . . . . . . . . . . . . . . . . . . . . 20
5.5 The value of labeled and unlabeled data . . . . . . . . . . . . . . . . 20
5.6 Understanding the risks of semi-supervised learning . . . . . . . . . . 20
5.7 Model search (and some philosophy...) in semi-supervised learning of
generative classifiers . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
II Low Density Separation 21
6 Transductive Support Vector Machines 23
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.2 TSVM Optimization Problems . . . . . . . . . . . . . . . . . . . . . 24
6.3 Why use margin on the test set? . . . . . . . . . . . . . . . . . . . . 24
6.4 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.6 Problems with the TSVM objective . . . . . . . . . . . . . . . . . . . 24
6.7 Connection to related approaches . . . . . . . . . . . . . . . . . . . . 24
7 Semi-Supervised Learning using Semi-Definite Programming 25
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.2 Relaxing SVM-transduction . . . . . . . . . . . . . . . . . . . . . . . 26
7.3 An efficient approximation . . . . . . . . . . . . . . . . . . . . . . . . 26
7.4 General semi-supervised learning settings . . . . . . . . . . . . . . . 26
7.5 Empirical results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
7.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8 Gaussian Processes and the Null-Category Noise Model 27
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
8.2 The Noise Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
8.3 The Process Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
8.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
8.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
9 Entropy-Based Methods 29
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
9.2 Derivation of the Criterion . . . . . . . . . . . . . . . . . . . . . . . . 30
9.3 Optimization Algorithms . . . . . . . . . . . . . . . . . . . . . . . . 30
9.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
9.5 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
9.6 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
Contents vii
III Graph-Based Methods 31
10 Label Propagation and Quadratic Criterion 33
10.1 Label propagation on a similarity graph . . . . . . . . . . . . . . . . 34
10.2 Quadratic cost criterion . . . . . . . . . . . . . . . . . . . . . . . . . 34
10.3 From transduction to induction . . . . . . . . . . . . . . . . . . . . . 34
10.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
11 The Manifold Basis of Semi-supervised Learning 35
11.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
11.2 Theoretical framework: RKHS/Manifolds, manifold representer the-
orems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
11.3 Algorithmic framework: Laplacian SVM/ Laplacian RLS . . . . . . . 35
11.4 Data-dependent kernels and linear methods . . . . . . . . . . . . . . 35
11.5 Future directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
12 Regularization on Directed Graphs 37
12.1 Preliminaries on Directed Graphs and Markov Chains . . . . . . . . 37
12.2 Regularization Framework A: hubs and authorities . . . . . . . . . . 37
12.3 Regularization Framework B: stationary distribution . . . . . . . . . 37
12.4 Regularization Framework C: water-pipe model . . . . . . . . . . . . 37
12.5 Discrete Differential Geometry on Directed graphs . . . . . . . . . . 37
12.6 Experiments on Web Classification and Protein Prediction . . . . . . 37
13 Boosting on Manifolds 39
13.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
13.2 Marginal AdaBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
13.3 RegBoost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
13.4 The graph Laplacian regularizer . . . . . . . . . . . . . . . . . . . . . 40
13.5 Other approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
IV Change of Representation 41
14 Graph Kernels by Spectral Transforms 43
14.1 Graph Laplacian and Spectral Clustering . . . . . . . . . . . . . . . 43
14.2 Kernels by Spectral Transforms . . . . . . . . . . . . . . . . . . . . . 44
14.3 Kernel Alignment using QCQP . . . . . . . . . . . . . . . . . . . . . 44
14.4 Semi-Supervised Kernels with Order Constraints . . . . . . . . . . . 44
14.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
15 Spectral Methods for Dimensionality Reduction 45
15.1 introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
15.2 algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
15.3 discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
viii Contents
15.4 appendix (if necessary) . . . . . . . . . . . . . . . . . . . . . . . . . . 46
16 Modifying Distances 47
16.1 A simple geometric intuition . . . . . . . . . . . . . . . . . . . . . . . 48
16.2 Theoretical basis for (re)defining distances . . . . . . . . . . . . . . . 48
16.3 Defining density-sensitive distances . . . . . . . . . . . . . . . . . . . 48
16.4 Using modified distances for semi-supervised learning . . . . . . . . . 48
17 Data-Dependent Regularization 49
V Semi-Supervised Learning in Practice 51
18 Large-Scale Algorithms 53
18.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
18.2 Cost approximation . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
18.3 Subset selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
18.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
19 Semi-Supervised Protein Classification using Cluster Kernels 55
19.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
19.2 Representations and kernels for protein Sequences . . . . . . . . . . 55
19.3 Cluster Kernels for protein Sequences . . . . . . . . . . . . . . . . . 55
19.4 The neighborhood mismatch kernel . . . . . . . . . . . . . . . . . . . 56
19.5 The bagged mismatch kernel . . . . . . . . . . . . . . . . . . . . . . 56
19.6 The profile kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
19.7 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
19.8 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56
20 Prediction of Protein Function from Networks 57
20.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
20.2 Graph-based Semi-Supervised Learning . . . . . . . . . . . . . . . . 58
20.3 Combining Multiple Graphs . . . . . . . . . . . . . . . . . . . . . . . 58
20.4 Experiments on Function Prediction of Proteins . . . . . . . . . . . . 58
20.5 Conclusion and Outlook . . . . . . . . . . . . . . . . . . . . . . . . . 58
VI Perspectives 59
21 An Augmented PAC Model for Semi-Supervised Learning 61
21.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
21.2 A Formal Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
21.3 Sample Complexity Results . . . . . . . . . . . . . . . . . . . . . . . 67
21.4 Algorithmic Results . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
21.5 Related Models and Discussion . . . . . . . . . . . . . . . . . . . . . 80
Contents ix
22 Metric Based Approaches for Semi-Supervised Regression 85
22.1 Metrics for Supervised Learning . . . . . . . . . . . . . . . . . . . . . 85
22.2 Model Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
22.3 Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
22.4 Connections to Active Learning . . . . . . . . . . . . . . . . . . . . . 85
22.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85
References 86
Index 90
Notation 92
Notation and Symbols 93
1 A Tutorial Introduction
1.1 Supervised, Unsupervised, and Semi-Supervised Learning
In order to understand the nature of semi-supervised learning, it will be useful to
first take a look at supervised and unsupervised learning.
1.1.1 Supervised and Unsupervised Learning
Traditionally, there have been two fundamentally different types of tasks in machine
learning.
The first one is unsupervised learning. Let X = (x1, . . . , xn) be a set of n examplesunsupervised
learning (or points), where xi ∈ X for all i ∈ [n] := 1, . . . , n. It is often convenient to define
the (n × d)-matrix X = (x>i )>i∈[n] that contains the data points as its rows. The
goal of unsupervised learning is to find interesting structure in the data X. It
has been argued that the problem of unsupervised learning is fundamentally that
of estimating a density which is likely to have generated X. However, there are
also weaker forms of unsupervised learning, such as quantile estimation, clustering,
outlier detection and dimensionality reduction.
The second task is supervised learning. The goal is to learn a mapping fromsupervised
learning x to y, given a training set made of pairs (xi, yi). Here, the yi ∈ Y are called
the labels or targets of the examples xi. If the labels are real (regression) or integer
(classification) numbers, y = (yi)>i∈[n] denotes the column vector of labels. The task
is well defined, since a mapping can be evaluated through its predictive performance
on test examples. When Y = R or Y = Rd, the task is called regression. Most of this
book will focus on classification, i.e. when y takes values in a finite set. There are two
families of algorithms for supervised learning: generative algorithms try to modelgenerative
methods the class conditional density p(x|y) by some unsupervised learning procedure.1 A
1. For simplicity, we are assuming that all distributions have densities, and thus we restrictourselves to dealing with densities.
2 A Tutorial Introduction
predictive density can then be inferred by applying Bayes’ rule:
p(y|x) =p(x|y)p(y)∫
Yp(x|y)p(y)dy
. (1.1)
In fact, p(x|y)p(y) = p(x, y) is the joint density of the data, from which pairs
(xi, yi) could be generated. Discriminative algorithms do not try to estimate howdiscriminative
methods the xi have been generated, but instead concentrate on estimating p(y|x). Some
discriminative methods even only try to model whether p(y|x) is greater than or
less than 0.5; an example of this is the Support Vector Machine. It has been argued
that discriminative models are more directly aligned with the goal of supervised
learning and therefore tend to be more efficient in practice. The two frameworks
are discussed in more detail in Sections 2.2 and 2.3.
1.1.2 Semi-Supervised Learning
Semi-supervised learning (SSL) is half way between supervised and unsupervised
learning. In addition to unlabeled data, the algorithm is provided with some super-
vision information – but not necessarily for all examples. Often, this information
will be the targets associated with some of the examples. In this case, the data setstandard setting
of SSL X = (xi)i∈[n] can be divided into two parts: the points Xl := (x1, . . . , xl), for which
labels Yl := (y1, . . . , yl) are provided, and the points Xu := (xl+1, . . . , xl+u), the
labels of which are not known. This is the “standard” semi-supervised learning as
investigated in this book; most chapters will refer to this setting.
Other forms of partial supervision are possible. For example, there may be
constraints such as “these points have (or do not have) the same target” (cf. Abu-
Mostafa [1995]). This more general setting is considered in Chapter 3. The differentSSL with
constraints setting corresponds to a different view on semi-supervised learning: In Chapter
3, SSL is seen as unsupervised learning guided by constraints. In contrast, most
other approaches see SSL as supervised learning with additional information on the
distribution of the examples x. The latter interpretation seems to be more in line
with most applications, where the goal is the same as in supervised learning: to
predict a target value for a given xi. However, this view does not readily apply if
number and nature of the classes are not known in advance but have to be infered
from the data. In constrast, SSL with constraints may still remain applicable in
such situations.
A problem related to SSL has been introduced by Vapnik already several decades
ago: the so-called transductive learning. In this setting, one is given a (labeled)transductive
learning training set and an (unlabeled) test set. The idea of transduction is to perform
predictions only for the test points. This is in contrast to inductive learning, whereinductive learningthe goal is to output a prediction function which is defined on the entire space X.
Many methods described in this book will be transductive; in particular, this is
rather natural for inference based on graph representations of the data. This issue
will be addressed again below.
1.2 When Can Semi-Supervised Learning Work? 3
1.2 When Can Semi-Supervised Learning Work?
A natural question arises: is semi-supervised learning meaningful? More precisely:
in comparison with a supervised algorithm that uses only labeled data, can one
hope to have a more accurate prediction by taking into account the unlabeled
points? As you may have guessed from the size of the book in your hands, in
principle the answer is “yes.” However, there is an important prerequisite: that the
distribution of examples, which the unlabeled data will help elucidate, be relevant
for the classification problem.
In a more mathematical formulation, one could say that the knowledge on p(x)
that one gains through the unlabeled data has to carry information that is useful
in the inference of p(y|x). If this is not the case, semi-supervised learning will not
yield an improvement over supervised learning. It might even happen that using
the unlabeled data degrades the prediction accuracy by misguiding the inference;
this effect is investigated in detail in Chapter 5.
One should thus not be too surprised that for semi-supervised learning to work,
certain assumptions will have to hold. In this context, note that plain supervised
learning also has to rely on assumptions: One of the most popular such assumptionsSmoothness
Assumption can be formulated as follows.2
Smoothness Assumption of Supervised Learning. If two points x1, x2 are
close, then so should be the corresponding outputs y1, y2.
Clearly, without such assumptions, it would never be possible to generalize from a
finite training set to a set of possibly infinitely many unseen test cases.
1.2.1 The Semi-Supervised Smoothness Assumption
We now propose a generalization of the Smoothness Assumption that is useful
for semi-supervised learning; we thus call it the “Semi-Supervised Smoothness
Assumption.” While in the supervised case according to our prior beliefs the output
varies smoothly with the distance, we now also take into account the density of the
inputs. The assumption is that the label function is smoother in high density regionsSemi-Supervised
Smoothness
Assumption
than in low density regions:
Semi-Supervised Smoothness Assumption. If two points x1, x2 in a high-
density region are close, then so should be the corresponding outputs y1, y2.
Note that by transitivity, this assumption implies that if two points are linked by
a path of high density (e.g., if they belong to the same cluster), then their outputs
are likely to be close. If, on the other hand, they are separated by a low density
2. Strictly speaking, this assumption only refers to continuity rather than smoothness;however, the term smoothness is commonly used, possibly because in regression estimationpractice, y is often modelled as a smooth function of x.
4 A Tutorial Introduction
region, then their outputs need not be close.
Note that the Semi-Supervised Smoothness Assumption applies to both regres-
sion and classification. In the next section, we will show that in the case of clas-
sification, it reduces to assumptions commonly used in SSL. At present, it is less
how useful the assumption is regression problems. As an alternative, Chapter 22
explores how semi-supervised regression can take advantage of assumptions about
the output space, which is richer in regression than it is in classification.
1.2.2 The Cluster Assumption
Suppose we knew that the points of each class tended to form a cluster. Then theCluster
Assumption unlabeled data could aid in finding the boundary of each cluster more accurately:
one could run a clustering algorithm and use the labeled points to assign a class to
each cluster. That is in fact one of the earliest forms of semi-supervised learning (see
Chapter 2). The underlying, now classical, assumption may be stated as follows:
Cluster Assumption. If points are in the same cluster, they are likely to be of
the same class.
This assumption may be considered reasonable on the basis of the sheer existence
of classes: if there is a continuum of objects, it is unlikely that they were ever
distinguished into different classes.
Note that the Cluster Assumption does not imply that each class forms a single,
compact cluster: it only means that, usually, we do not observe objects of two
distinct classes in the same cluster.
The Cluster Assumption can easily be seen to a special case of the above proposed
Semi-Supervised Smoothness Assumption, considering that clusters are frequently
defined as being sets of points that can be connected by short curves which traverse
only high densites regions.
The cluster assumption can be formulated in an equivalent way:Low Density
SeparationLow Density Separation. The decision boundary should lie in a low density
region.
The equivalence is easy to see: A decision boundary in a high density region would
cut a cluster into two different classes; many objects of different classes in the same
cluster would require the decision boundary to cut the cluster, i.e. to go through a
high density region.
Although the two formulations are conceptually equivalent, they can inspire
different algorithms, as we will argue in Section 1.3. The Low Density version
also gives additional intuition why the assumption is sensible in many real world
problems. Consider digit recogntion for instance and suppose that one wants to
learn how to distinguish a hand-written digit “0” against digit “1.” The decision
boundary will be between a 0 and a 1, most likely a digit looking like a very
elongated zero. But the probability that someone wrote this “weird” digit is very
small.
1.2 When Can Semi-Supervised Learning Work? 5
1.2.3 The Manifold Assumption
A different but related assumption that forms the basis of several semi-supervised
learning methods is the manifold assumption:Manifold
AssumptionManifold Assumption. The (high-dimensional) data lie (roughly) on a low-
dimensional manifold.
How can this be useful? A well-known problem of many statistical methods
and learning algorithms is the so-called “Curse of Dimensionality.” It is relatedCurse of
Dimensionality to the fact that volume grows exponentially with the number of dimensions, and an
exponentially growing number of examples is required for statistical tasks such as
the reliable estimation of densities. This is a problem that directly affects generative
approaches that are based on density estimates in input space. A related problem
of high dimensions, which may be more severe for discriminative methods, is that
pairwise distances tend to become more similar, and thus less expressive.
If the data happen to lie on a low-dimensional manifold, however, then the
learning algorithm can essentially operate in a space of corresponding dimension,
thus avoiding the curse of dimensionality.
As above, one can argue that algorithms working with manifolds may be seen
as approximately implementing the semi-supervised smoothness assumption: such
algorithms use the metric of the manifold for computing geodesic distances. If we
view the manifold as an approximation of the high-density regions, then it becomes
clear that in this case, the semi-supervised smoothness assumption reduces to the
standard smoothness assumption of supervised learning, applied on the manifold.
Note that if the manifold is embedded into the high-dimensional input space in a
curved fashion (i.e., it is not just a subspace), geodesic distances differ from those in
the input space. By ensuring more accurate density estimates and more appropriate
distances, the manifold assumption may be useful for classification as well as for
regression.
1.2.4 Transduction
As mentioned before, some algorithms naturally operate in a transductive setting.
According to the philosophy put forward by Vapnik, high-dimensional estimation
problems should attempt to follow the following principle:
Vapnik’s Principle. When trying to solve some problem, one should not solve a
more difficult problem as an intermediate step.
Consider as an example supervised learning, where predictions of labels y corre-
sponding to some objects x are asked for. Generative models estimate the density
of x as intermediate step, while discriminative methods directly estimate the labels.
In a similar way, if label predictions are only required for a given test set,
transduction can be argued to be more direct than the corresponding induction:
while an inductive method infers a function f : X → Y on the entire space X, and
6 A Tutorial Introduction
afterwards returns the evaluations f(xi) at the test points, transduction consists
of directly estimating the finite set of test labels, i.e. a function f : Xu → Y only
defined on the test set. Note that transduction is different from SSL as we defined
it above.
Now suppose we are given a transductive algorithm which produces a solution
superior to an inductive algorithm trained on the same labeled data (but discarding
the unlabeled data). Then the performance difference might be due to one of the
following two points (or a combination thereof):
1. transduction follows Vapnik’s principle more closely than induction does, or
2. the transductive algorithm takes advantage of the unlabeled data in a way similar
to semi-supervised learning algorithms.
There is ample evidence for improvements being due to the second of these points.
We are presently not aware of empirical results that selectively support the first
point.
In fact, one could argue that prediction on an arbitrary point x is no less difficult
than prediction on all points. Any algorithm performing transduction can also be
seen as one performing induction: instead of returning the label function, it directly
implements it. The function can be evaluated at any point by adding it to the
original test set and calling the transduction algorithm.
1.3 Classes of Algorithms and Organization of This Book
Although many methods were not explicitly derived from one of the above assump-
tions, most algorithms can be seen to correspond to or implement one or more
assumptions. We try to organize the semi-supervised learning methods presented
in this book into four classes that roughly correspond to the underlying assump-
tion. Although the classification is not always unique, we hope that this organiza-
tion makes the book and its contents more accessible to the reader, by providing a
guiding scheme.
For the same reason, this book is organized in “parts.” There is one part for each
class of SSL algorithms and an extra part focussing on generative approaches. Two
further parts are devoted to applications and perspectives of SSL. In the following
we briefly introduce the ideas of each book part.
1.3.1 Generative Models
Part I presents history and state of the art of SSL with generative models. Chapter
2 starts with a thorough review of the field.
Inference using a generative model involves the estimation of the conditional
density p(x|y). In this setting, any additional information on p(x) is useful. As
a simple example, assume that p(x|y) is Gaussian. Then one can use the EM
1.3 Classes of Algorithms and Organization of This Book 7
algorithm to find the parameters of the Gaussian corresponding to each class.
The only difference to the standard EM algorithm as used for clustering is that
the “hidden variable” associated with any labeled example is actually not hidden,
but it is known and equals its class label. It implements the Cluster Assumption
(cf. Section 2.2), since a given cluster belongs to only one class.
This small example already highlights different interpretations of semi-supervised
learning with a generative model:
It can be seen as classification with additional information on the marginal
density.
It can be seen as clustering with additional information. In the standard setting,
this information would be the labels of a subset of points, but it could also come
in the more general form of constraints (see Chapter 3).
A strength of the generative approach is that knowledge on the structure of the
problem or the data can naturally be incorporated by modeling it. In Chapter 4, this
is demonstrated for the application of the EM algorithm to text data. It is observed
that, in some circumstances, unlabeled data can decrease prediction accuracy. This
effect is investigated in depth in Chapter 5.
In statistical learning, before performing inference, one chooses a class of func-
tions, or a prior over functions. One has to choose it according to what is known
in advance about the problem. In the semi-supervised learning context, if one has
some ideas about what the structure of the data tells about the target function, the
choice of this prior can be made more precise after seeing the unlabeled data: onedata-dependent
priors could typically put a higher prior probability on functions that satisfy the cluster
assumption. In Chapter 8, an example is given of such a data-dependent prior.
1.3.2 Low Density Separation
Part II of this book aims at describing algorithms which try to directly implement
the Low Density Separation Assumption by pushing the decision boundary away
from the unlabeled points.
The most common approach to achieve this goal is to use a maximum margin
algorithm such as Support Vector Machines. The method of maximizing the margin
also for unlabeled points is called Transductive SVM (TSVM). However, theTransductive
SVM (TSVM) corresponding problem is nonconvex and thus difficult to optimize.
One optimization algorithm for the TSVM is presented in Chapter 6. Starting
from the SVM solution as trained on the labeled data only, the unlabeled points are
labeled by SVM predictions, and the SVM is retrained on all points. This is iterated
while the weight of the unlabeled points is slowly increased. Another possibility is
the SDP relaxation suggested in Chapter 7.
Two alternatives to the TSVM are presented that are formulated in a probabilistic
and in an information theoretic framework, respectively. In Chapter 8, binary
Gaussian Process classification is augmented by the introduction of a null-class
that occupies the space between the two regular classes. As an advantage to the
8 A Tutorial Introduction
TSVM, this allows for probabilistic outputs.
This advantage is shared by the entropy minimization presented in Chapter 9.
It encourages the class conditional probabilities P (y|x) to be close to either 1 or 0
at labeled and unlabeled points. As a consequence of the smoothness assumption,
the probability will tend to be close to 0 or 1 throughout any high density region,
while class boundaries correspond to intermediate probabilities.
1.3.3 Graph-Based Methods
During the last couple of years, the most active area of research in semi-supervised
learning has been in graph based methods, which are the topic of Part III of this
book. The common denominator of these methods is that the data are represented
by the nodes of a graph, the edges of which are weighted by the pairwise distances
of the incident nodes (and a missing edge corresponds to infinite distance). If the
distance of two points is computed by minimizing the aggregate path distance over
all paths connecting the two points, this can be seen as an approximation of the
geodesic distance of the two points with respect to the manifold of data points.
Thus, graph methods can be argued to build on the manifold assumption.
Most graph methods refer to the graph by utilizing the graph Laplacian. Let
G = (V,E) be a graph with real edge weights given by w : E → R. Here, the weight
w(e) of an edge e indicates the similarity of the incident nodes (and a missing edge
corresponds to zero similarity). Now the weighted adjacency matrix (or weightweight matrix
matrix, for short) W of the graph G = (V,E) is defined by
Wij :=
w(e) if e = (i, j) ∈ E
0 if e = (i, j) ∈ E(1.2)
The diagonal matrix D defined by Dii :=∑
j Wij is called the degree matrix of
G. Now there are different ways of defining the graph Laplacian, the two mostgraph Laplacian
prominent of which are the normalized graph Laplacian, L, and the un-normalized
graph Laplacian, L:
L := D−1/2WD−1/2
L := D − W(1.3)
Many graph methods that penalize non-smoothness along the edges of a weighted
graph can in retrospect be seen as different instances of a rather general family
of algorithms, as is outlined in Chapter 10. An alternative approach is persued in
Chapter 13, where the graph Laplacian is used to regularize boosting.
Usually the prediction consists of labels for the unlabeled nodes. For this reason,
this kind of algorithm is intrinsically transductive, i.e., it returns only the value of
the decision function on the unlabeled points and not the decision function itself.
However, there has been recent work in order to extend graph based methods to
produce inductive solutions, as discussed in Chapter 11.
Often the graph G is costructed by computing similarities of objects in some
1.3 Classes of Algorithms and Organization of This Book 9
other representation, e.g. using a kernel function on Euclidean data points. But
sometimes the original data already have the form of a graph. Examples include
the linkage pattern of web pages and the interactions of proteins (see Chapter 20).
In such cases, the directionality of the edges may be important. In Chapter 12,
principled approaches of dealing with directed graphs are developed.
1.3.4 Change of Representation
The topic of Part IV are algorithms that are not intrinsically semi-supervised, but
instead perform a two-step learning:
1. Perform an unsupervised step on all data, labeled and unlabeled, but ignoring
the available labels. This can, for instance, be a change of representation, or the
construction of a new metric or a new kernel.
2. Ignore the unlabeled data and perform plain supervised learning using the new
distance, representation or kernel.
This can be seen as direct implementation of the Semi-Supervised Smoothness
Assumption, since the representation is changed in such a way that small distances
in high density regions are conserved.
Note that the graph-based methods (Part III) are closely related to the ones
presented in this book part: the very construction of the graph from the data
can be seen as an unsuervised change of representation. Consequently, the first
chapter of this part discusses spectral transforms of such graphs in order to build
kernels. Spectral methods can also be used for non-linear dimensionality reduction,
as extended in Chapter 15. Furthermore, in Chapter 16, metrics derived from graphs
are investigated, for exampled derived from shortest paths.
1.3.5 Self-Learning
For completeness, we mention the idea of self-learning, which is also known as self-
training, self-labeling, or decision directed learning. This is a wrapper-algorithm
that repeatedly uses a supervised learning method. It starts by training on the
labeled data only. In each step a part of the unlabeled points are labeled according
to the current decision function; then the supervised method is retrained using its
own predictions as additional labeled points. This idea appears in the literature
already for some time (e.g., Scudder [1965], Fralick [1967], Agrawala [1970]).
An unsatisfactory aspect of self-learning is that the resulting effect depends
on the employed supervised method. If self-learning is used with empirical risk
minimization and 1-0-loss, the unlabeled data will have no effect on the solution
at all. If instead a margin maximizing method is used, as a result the decision
boundary is pushed away from the unlabeled points. In other cases it seems to be
unclear what the self-learning is really doing, and which assumption it corresponds
to.
Correspondingly, we limit the coverage of self-learning in this book to the
10 A Tutorial Introduction
Transductive SVM (cf. Chapter 6), which is probably its most successful instance.
1.3.6 Semi-Supervised Learning in Practice
Semi-supervised learning will be most useful whenever there are far more unlabeled
data than labeled. This can be the case is obtaining data points is cheap, but
obtaining the labels cost a lot of time, effort or money. This is the case in many
application areas of machine learning, for example:
In speech recognition, it cost almost nothing to record huge amounts of speech,
but labeling it requires some human to listen to it and type a transcript.
Billions of web pages are directly available for automated processing, but to
reliably classify them, humans had to read them.
Protein sequences are nowadays determined at industrial speed (by genome se-
quencing, computational gene finding and automatic translation), but to determine
3D structure or functions of a single protein requires up to years of scientific work.
Web page classification has already been introduced in Chapter 4 in the context of
generative models.
Since unlabeled data carry less information than labeled data, they are required
in large amounts in order to significantly increase the prediction accuracy. This
implies the need for fast and efficient SSL algorithms. Chapters 18 and 19. present
two approaches to being able to deal with huge numbers of points. In Chapter 18
methods are developed for speeding up the label propagation methods introduced in
Chapter 10. In Chapter 19 cluster kernels are shown to be an efficient SSL method.
Chapter 19 also presents the first of two approaches to an important bioinformat-
ics applications of semi-supervised learning: the classification of protein sequences.
While here the predictions are based on the protein sequences themselves, Chapter
20 moves on to a somewhat more complex setting: The information is here assumed
to be present in the form of graphs that characterize the interactions of proteins.
Several such graphs exist and have to be combined in a appropriate way.
1.3.7 Perspectives
The last part of the book, Part VI, shows two of the most interesting directions of
ongoing research in SSL.
So far this book mostly presented algorithms for semi-supervised learning. While
the assumptions discussed above supply some intuition on when and why SSL works,
and Chapter 5 investigates when and why it can fail, it would clearly be more
satisfactory to have a thorough theoretical understanding of SSL in total. Chapter
21 offers a PAC-style framework that yields error bounds for SSL problems.
Furthermore, until now this book has mostly resticted itself to classification.
Chapter 22 introduces another assumption for SSL that is particularly well suited
for regression, and derives algorithms from it.
I Generative Models
12 Generative Models
2 Probabilistic Graphical Models for Semi-
Supervised Learning
Matthias Seeger
In this Section we propose a simple taxonomy of probabilistic graphical models for
the SSL problem. We give some broad classes of algorithms for each of the families
and point to specific realizations in the literature. Finally, we shed more detailed
light on the family of methods using input-dependent regularization (or conditional
prior distributions) and show parallels to the co-training paradigm.
2.1 The Semi-Supervised Learning Problem
2.2 The Sampling Paradigm (Generative Models)
2.3 The Diagnostic Paradigm. Input-Dependent Regularization
2.4 Examples: Generative Models
2.5 Examples: Input-Dependent Regularization
2.6 A General Scheme for Input-Dependent Regularization. Relations to Co-
Training
3 Probabilistic Semi-Supervised Clustering
with Constraints
Sugato Basu
Raymond Mooney
Mikhail Bilenko
Arindam Banerjee
In certain clustering problems it is often natural to get limited supervision in
the form of pairwise constraints, i.e., pairs of instances labeled as belonging to
same or different clusters. In recent years some algorithms have been proposed
for enhancing clustering quality by using supervision in the form of constraints.
These algorithms typically utilize the pairwise constraints to either modify the
clustering objective function or to learn the clustering distortion measure. We
propose a probabilistic model for semi-supervised clustering based on Hidden Markov
Random Fields (HMRFs) that integrates these two approaches and provides a
principled framework for incorporating constraint-based supervision into prototype-
based clustering. The model allows the use of a broad range of clustering distortion
measures, including Bregman divergences (e.g., squares Euclidean distance and I-
divergence) and directional distance measures (e.g., cosine distance), making it
applicable to several domains. We present an algorithm that performs partitional
semi-supervised clustering of data by minimizing an objective function derived
from the posterior energy of the HMRF model. We also present a method for
actively selecting informative pairwise constraints in a query-driven framework, to
get improved clustering performance with relatively small amounts of supervision.
3.1 Introduction
Description of the problem of semi-supervised clustering with constraints, overview
of different domains where constraints are natural form of supervision
16 Probabilistic Semi-Supervised Clustering with Constraints
3.2 Related work
Brief overview of existing work in constraint-based and metric-based semi-
supervised clustering
3.3 Model
Motivation of HMRF model, connection between MAP estimation on HMRF and
constrained semi-supervised clustering objective function
3.4 Algorithm
Details of initialization step, E-step and M-step, proof of convergence for different
distortion measures, integration of constraint-based and metric-based approaches
3.5 Constraint acquisition
Active learning algorithm for acquiring constraints, analysis of 2-phase algorithm
3.6 Experiments
Demonstration of effectiveness of semi-supervised clustering algorithm and active
selection of constraints for different domains
4 Semi-Supervised Text Classification Using
EM
Kamal Nigam
Andrew McCallum
Tom Mitchell
One key difficulty with text classification learning algorithms is that they require
many hand-labeled examples to learn accurately. This chapter demonstrates that
supervised learning algorithms that use a small number of labeled examples and
many inexpensive unlabeled examples can create high-accuracy text classifiers. By
assuming that documents are created by a parametric generative model, Expectation-
Maximization (EM) finds local maximum a posteriori models and classifiers from
all the data—labeled and unlabeled. These generative models do not capture all the
intricacies of text. However in some domains this technique substantially improves
classification accuracy, especially when labeled data are sparse.
Two problems arise from this basic approach. First, unlabeled data can hurt
performance in domains where the generative modeling assumptions are too strongly
violated. In this case the assumptions can be made more representative in two ways:
by modeling sub-topic class structure, and by modeling super-topic hierarchical class
relationships. By doing so, model probability and classification accuracy come into
correspondence, allowing unlabeled data to improve classification performance. The
second problem is that even with a representative model, the improvements given
by unlabeled data do not sufficiently compensate for a paucity of labeled data. Here,
limited labeled data provide EM initializations that lead to low-probability models.
Performance can be significantly improved by using alternatives to EM that avoid
low-probability local maxima.
4.1 Introduction
18 Semi-Supervised Text Classification Using EM
4.2 The Basic Setting
4.2.1 A Generative Model for Text
4.2.2 Naive Bayes Text Classification
4.2.3 Learning a Naive Bayes Model with Labeled and Unlabeled Data
4.2.4 Experimental Results
4.2.5 Discussion
4.3 Modeling Sub-topic Structure
4.3.1 EM with Multiple Mixtures Per Class
4.3.2 Experimental Results
4.3.3 Discussion
4.4 Modeling Super-topic Structure
4.4.1 Estimating Parameters of a Hierarchical Model
4.4.2 Experimental Results
4.4.3 Discussion
4.5 Finding More Probable Models
4.5.1 Many Random Starting Points
4.5.2 Determininstic Annealing
4.5.3 Experimental Results
4.5.4 Discussion
4.6 Related Work
4.7 Conclusions
5 Risks of Semi-Supervised Learning:
How Unlabeled Data Can Degrade Performance
of Generative Classifiers
Fabio Cozman
Ira Cohen
Generative classifiers, even supervised ones, are often led astray by bad modeling as-
sumptions. But in semi-supervised learning we face a very embarrassing situation:
it is often the case that, in the presence of bad modeling assumptions, we would
be better off just discarding the unlabeled data and employing a supervised method,
rather than using both the labeled and unlabeled data. Typically we do not expect
to be better off by discarding data, how can this possibility arise in semi-supervised
learning? In this chapter we discuss theoretical and empirical results in the liter-
ature concerning the (relative) value of labeled and unlabeled data; we explain the
performance degradation phenomenon and show several examples of it. We then
discuss the effect of model search on generative classifiers based on semi-supervised
learning, and summarize related attempts in the literature to overcome performance
degradation.
5.1 Introduction
20 Risks of Semi-Supervised Learning: How Unlabeled Data Can Degrade Performance of Generative Classifiers
5.2 Do unlabeled data improve or degrade classification performance?
5.3 The Boy-Girl classifier
5.4 Insights from asymptotic analysis
5.5 The value of labeled and unlabeled data
5.5.1 When the model is correct
5.5.2 When the model is incorrect
5.6 Understanding the risks of semi-supervised learning
5.6.1 A Gaussian example
5.6.2 Building examples
5.6.3 Finite sample effects
5.7 Model search (and some philosophy...) in semi-supervised learning of
generative classifiers
5.7.1 Searching for models: is it necessary?
5.7.2 Searching for models: is it possible?
5.7.3 Using unlabeled data as an auxiliary device
II Low Density Separation
22 Low Density Separation
6 Transductive Support Vector Machines
Thorsten Joachims
In the transductive setting proposed by V. Vapnik, instead of learning a general
rule, the goal is to make predictions only at a fixed number of test points. This
allows the learning algorithm to exploit the location of the test points, making it
a particular type of semi-supervised learning problem. Transductive Support Vector
Machines (TSVMs) implement the idea of transductive learning by including test
points in the computation of the margin. This chapter will provide some examples
for why the margin on the test examples can provide useful prior information
for learning, in particular for the problem of text classification. The resulting
optimization problems, however, are difficult to solve. The chapter reviews exact
and approximate optimization methods and discusses their properties. Finally, the
chapter discusses connections to other related semi-supervised learning approaches
like co-training and methods based on graph cuts, which can be seen as solving a
simplified variant of the TSVM optimization problem.
6.1 Introduction
24 Transductive Support Vector Machines
6.2 TSVM Optimization Problems
6.3 Why use margin on the test set?
6.4 Algorithms
6.4.1 Exact
6.4.2 Approximate
6.5 Experiments
6.5.1 text classification
6.5.2 Use of TSVMs (Literature overview)
6.6 Problems with the TSVM objective
6.7 Connection to related approaches
6.7.1 co-training
6.7.2 graph cuts
6.7.3 other related work
7 Semi-Supervised Learning using Semi-Definite
Programming
Tilj De Bie
Nello Cristianini
We discuss the problem of support vector machine transduction, which is a combina-
torial problem with exponential computational complexity in the number of unlabeled
samples. Different approaches to such combinatorial problems exist, among which
are exhaustive search strategies (only feasible for very small sample sizes) and lo-
cal search heuristics starting from a suitably chosen start value (scalable to large
problem sizes, but sensitive to local minima).
In this chapter, we discuss an alternative approach, based on a convex relaxation
of the optimization problem associated to support vector machine transduction. The
result is a semi-definite programming problem which can be solved in polynomial
time, the result of which is an approximation of the optimal labeling as well as
a bound on the true optimum of the original transduction objective function. To
further decrease the computational complexity, we introduce an approximation that
allows to solve transduction problems of up to 1000 unlabeled samples.
Lastly, we extend the formulation to more general settings of semi-supervised
learning, where equivalence and inequivalence constraints are given on labels of some
of the samples.
7.1 Introduction
26 Semi-Supervised Learning using Semi-Definite Programming
7.2 Relaxing SVM-transduction
7.2.1 Relaxation to an SDP-problem
7.2.2 Some simplifications
7.2.3 Recovering the label vector
7.3 An efficient approximation
7.3.1 The subspace trick
7.3.2 Finding a subspace close to the label vector
7.4 General semi-supervised learning settings
7.4.1 Equivalence and inequivalence label constraints
7.4.2 The subspace trick
7.5 Empirical results
7.6 Conclusions
8 Gaussian Processes and the Null-Category
Noise Model
Neil Lawrence
Michael Jordan
With Gaussian process classifiers (GPC) we aim to predict the posterior probability
of the class labels given an input position, p (yn|xn). In general we find that this
posterior distribution is unaffected by unlabelled data points during learning. Support
vector machines are strongly related to GPCs, but one notable difference is that the
decision boundary in a SVM can be influenced by unlabelled data. The source of
this discrepancy is the SVM’s margin: a characteristic which is not shared with the
GPC. The presence of the margin allows the support vector machine to seek low data
density regions for the decision boundary, effectively allowing it to incorporate the
cluster hypothesis (refer to Thorsten’s chapter here). In this chapter we present the
null category noise model. A probabilistic equivalent of the margin. By combining
this noise model with a GPC we are able to incorporate the cluster hypothesis
without explicity modelling of the input data density distributions or a special choice
of kernel.
8.1 Introduction
28 Gaussian Processes and the Null-Category Noise Model
8.2 The Noise Model
8.2.1 Ordinal Regression
8.2.2 The Null Category Noise Model
8.2.3 Prediction vs Transduction
8.3 The Process Model
8.3.1 Gaussian Processes
8.3.2 Assumed Density Filtering
8.4 Results
8.5 Discussion
9 Entropy-Based Methods
Yves Grandvalet
Yoshua Bengio
In this chapter, we consider the problem of learning a decision rule from labeled
and unlabeled data. In this framework, one motivates the use of discriminative
models, and an entropy regularization is devised to benefit from unlabeled data in the
supervised learning setting. The learning criterion, which is based on the penalized
likelihood principle, derives from clearly stated assumptions and can be applied
to any smoothly parametrized model of posterior probabilities. The regularization
scheme favors low density separation, without any modeling of the density of input
features. Other approaches to the semi-supervised problem are shown to be particular
or limiting cases of entropy regularization.
9.1 Introduction
30 Entropy-Based Methods
9.2 Derivation of the Criterion
9.2.1 Likelihood
9.2.2 When Are Unlabeled Examples Informative?
9.2.3 Entropy Regularization
9.3 Optimization Algorithms
9.3.1 Annealed EM and IRLS
9.3.2 Conjugate gradient
9.4 Related Work
9.4.1 Minimum Entropy in Pattern Recognition
9.4.2 Self-Training
9.4.3 Information Regularization
9.4.4 Transductive Support Vector Machines
9.5 Experiments
9.5.1 Artificial Data
9.5.2 Real Data
9.6 Discussion
III Graph-Based Methods
32 Graph-Based Methods
10 Label Propagation and Quadratic Criterion
Yoshua Bengio
Olivier Delalleau
Nicolas Le Roux
(edited by Misha Belkin, Dengyong Zhou, Xiaojin Zhu)
Various graph-based algorithms for semi-supervised learning have been proposed in
the recent literature. They rely on the idea of building a graph whose nodes are
data points (labeled and unlabeled) and edges represent similarities between points.
Known labels are used to propagate information through the graph in order to label
all nodes. In this chapter, we show how these different algorithms can be cast into
a common framework where one minimizes a quadratic cost criterion whose closed-
form solution is found by solving a linear system of size n. The cost criterion
naturally leads to an extension of such algorithms to the inductive setting, where one
obtains test samples one at a time: the derived induction formula can be evaluated
in O(n) time, which is much more efficient than solving again the linear system
(which in general costs O(kn2) time for a sparse graph where each data point has
k neighbors).
10.1 Label propagation on a similarity graph
34 Label Propagation and Quadratic Criterion
10.2 Quadratic cost criterion
10.2.1 Regularization on graphs
10.2.2 Link with label propagation
10.2.3 Optimization framework
10.3 From transduction to induction
10.4 Discussion
11 The Manifold Basis of Semi-supervised
Learning
Misha Belkin
Vikas Sindhwani
11.1 Introduction
11.1.1 Manifold methods and manifolds in real-world data
11.1.2 Regularizing on maniolds
11.1.3 From manifolds to the ambient space
11.2 Theoretical framework: RKHS/Manifolds, manifold representer theorems
11.3 Algorithmic framework: Laplacian SVM/ Laplacian RLS
11.4 Data-dependent kernels and linear methods
11.5 Future directions
12 Regularization on Directed Graphs
Dengyong Zhou
Jiayuan Huang
Bernhard Schoelkopf
Thomas Hofmann
Given a directed graph in which some of the nodes are labeled, we investigate the
question of how to exploit the link structure of the graph to infer the labels of the
remaining unlabeled nodes. To that extent we propose a family of regularization
frameworks for functions defined over nodes of a directed graph that forces the
classification function to change slowly on densely linked subgraphs, and compares
them with each other on real-world Web classification problems.
12.1 Preliminaries on Directed Graphs and Markov Chains
12.2 Regularization Framework A: hubs and authorities
12.3 Regularization Framework B: stationary distribution
12.4 Regularization Framework C: water-pipe model
12.5 Discrete Differential Geometry on Directed graphs
12.6 Experiments on Web Classification and Protein Prediction
13 Boosting on Manifolds
Balazs Kegl
Ligen Wang
In this chapter we combine two powerful ideas, boosting and manifold learning. The
algorithm is based on RegBoost, an extension of AdaBoost that allows us to incorpo-
rate a data-dependent complexity regularizer into the base learning step of AdaBoost.
If the regularization term does not depend on the labels, then the algorithm can be
used to semi-supervised learning in a natural manner: in each iteration, the error
term is calculated using the labeled points, and the regularization term is calculated
using all the data. The convergence of the algorithm can be controlled by a simple
regularization parameter. The convergence theorem also suggests various interpre-
tations that indicate why the algorithm would indeed achieve regularization. The
particular regularizer that we use is the graph Laplacian, which captures the natural
cluster or manifold structure of the data. For large data sets, computing the Lapla-
cian can be prohibitive, so, for simple base classifier sets such as decision trees or
decision stumps, we propose a simple regularizer that approximates the Laplacian
in linear time.
13.1 Introduction
40 Boosting on Manifolds
13.2 Marginal AdaBoost
13.2.1 Convergence
13.3 RegBoost
13.3.1 The goal of the base learner
13.3.2 Theoretical justifications
13.4 The graph Laplacian regularizer
13.4.1 An approximation in linear time
13.4.2 Experiments
13.5 Other approaches
IV Change of Representation
42 Change of Representation
14 Graph Kernels by Spectral Transforms
Xiaojin Zhu
Jaz Kandola
Many graph based semi-supervised learning methods can be viewed as regularizers,
i.e. kernels, over the graph. Such regularization penalizes unsmooth functions on
the graph. The central quantity in regularization is the spectral decomposition of the
graph Laplacian, a matrix derived from edge weights. The eigenvectors with small
eigenvalues are smooth and ideally represents large cluster structures within the
data. The ones with large eigenvalues are rugged and considered noise.
Functions are compared against these eigenvectors to determine their smoothness.
One can alter the eigenvalues to emphasize different eigenvectors. Such change, or
spectral transform, produces different regularizers. Many familiar kernels can be
produced by spectral transform, for example the diffusion kernels.
The question naturally arises, can we find a good spectral transform automati-
cally? The answer is yes. We learn a nonparametric spectral transform based on
convex optimization, to maximize the kernel alignment to the labeled data. We in-
corporate order constraints to encode our preference on smooth eigenvectors. This
results in flexible kernels and avoids the need to choose among different parametric
spectral transforms. Our approach relies on a quadratically constrained quadratic
program (QCQP), and is computationally feasible for large datasets.
14.1 Graph Laplacian and Spectral Clustering
In this section I plan to describe the graph Laplacian (normalized and unnor-
malized), with a little bit of spectral graph theory (Chung). I will emphasize the
importance of the smallest eigenvectors, linking it to spectral clustering.
14.2 Kernels by Spectral Transforms
44 Graph Kernels by Spectral Transforms
I will describe regularization on the graph, and how spectral transforms produce
different kernels. (I suspect this and the previous sections are also discussed
elsewhere in the book). In the end the question is how to automatically find a
good transform. This section will draw materials from Chapelle et al. ”Cluster
kernels”, and Smola and Kondor ”Kernels and Regularization on Graphs”.
14.3 Kernel Alignment using QCQP
I introduce kernel alignment as a convex measure for spectral transform. The
problem is then turned into a QCQP. This and next sections draw from Zhu et
al. ”Nonparametric Transforms”.
14.4 Semi-Supervised Kernels with Order Constraints
Finally I introduce order constraints on the eigenvalues. The solution is a nonpara-
metric spectral transform, which produces a semi-supervised kernel.
14.5 Experimental Results
15 Spectral Methods for Dimensionality Reduction
Lawrence Saul
Kilian Weinberger
Fei Sha
Jihun Ham
Daniel Lee
How can we detect low dimensional structure in high dimensional data? If the data
is mainly confined to a low dimensional subspace, then simple linear methods can
be used to discover the subspace and estimate its dimensionality. More generally,
though, if the data lies on (or near) a low dimensional manifold, then its structure
may be highly nonlinear, and linear methods are bound to fail.
The last few years have witnessed several advances in the problem of nonlin-
ear dimensionality reduction. Given high dimensional data sampled from a low di-
mensional manifold, researchers now have several frameworks for estimating the
data’s intrinsic dimensionality and computing a faithful low dimensional embed-
ding. Surprisingly, the main computations for ”manifold learning” are based on
highly tractable optimizations, such as nearest-neighbor searches, least squares fits,
semidefinite programming, and eigenvalue problems. This chapter will survey the
numerous algorithms in machine learning that can be viewed as spectral methods
for linear and nonlinear dimensionality reduction.
15.1 introduction
46 Spectral Methods for Dimensionality Reduction
15.2 algorithms
15.2.1 PCA
15.2.2 multidimensional scaling
15.2.3 isomap
15.2.4 maximum variance unfolding
15.2.5 locally linear embedding (LLE)
15.2.6 laplacian eigenmaps
15.2.7 hessian LLE
15.2.8 local tangent space alignment
15.2.9 conformal reweighting
15.2.10 kernel PCA
15.3 discussion
15.3.1 comparison of algorithms
15.3.2 open questions
15.4 appendix (if necessary)
15.4.1 rayleigh-ritz variational method for eigenvalues
15.4.2 semidefinite programming
16 Modifying Distances
Pascal Vincent
Yoshua Bengio
Matthias Hein
Alexander Zien
In this chapter, we examine how we can implement the following intuitive idea:
modify the geometry of input space, more specifically the way we compute pairwise
distances, to induce a “natural closeness” for points likely to be in the same cluster
(according to the cluster assumption, i.e. if they are linked by a region of high
density) and a “natural separatedness” for points of different clusters (if they can
only be connected by traversing regions of low density). The idea is thus to “warp
space” in such a way to “shrink” distances in high density regions and to “blow
up” (comparatively speaking) distances in low density regions. After motivating
and illustrating this geometric intuition in more details, we briefly introduce the
theoretical bases that will allow us to properly devise these new distances: we
formally introduce metric spaces, shortest path metrics and Riemannian metrics.
We then show several approaches that have been proposed in the literature, to
define local metrics that depend on the data’s estimated density. We also detail
the necessary approximation techniques used to make these distance computations
feasible in practice. Finally, we show how these techniques can be applied in the
context of semi-supervised learning, by taking into account unlabeled points to obtain
the modified distances, and from those distances either define density-sensitive
kernels to be used in kernel methods, or perform a preprocessing step that will map
the initial data to a new space in which clusters are easier to separate.
16.1 A simple geometric intuition
48 Modifying Distances
16.2 Theoretical basis for (re)defining distances
16.2.1 Metric space
16.2.2 Path metrics
16.2.3 Riemannian metrics
16.3 Defining density-sensitive distances
16.3.1 General Riemannian metrics based on an actual density estimate
16.3.2 Approximations used in shortest path computations
16.3.3 Other simple data-dependent path metrics proposed in the
litterature
16.3.4 Illustration of the resulting geometric transformations
16.4 Using modified distances for semi-supervised learning
16.4.1 Forming kernels from the modified distances
16.4.2 Preprocessing through Multidimensional Scaling to map the
data to a new space
16.4.3 Experiments showing improved classification accuracy
17 Data-Dependent Regularization
Tommi Jaakkola
Adrian Corduneanu
V Semi-Supervised Learning in Practice
52 Semi-Supervised Learning in Practice
18 Large-Scale Algorithms
Olivier Delalleau
Yoshua Bengio
Nicolas Le Roux
In chapter 10, it is shown how a number of graph-based semi-supervised learning
algorithms can be seen as the minimization of a specific cost function, leading to
a linear system with n equations and unknowns. Solving such a linear system will
in general require on the order of O(kn2) time and O(kn) memory (for a sparse
graph where each data point has k neighbors), which can be prohibitive on large
datasets. We present in this chapter a subset selection algorithm that can be used
to reduce the original system to one of size m n. The idea is to solve for the
labels of a subset S ⊂ X of only m points, while still retaining information from
the rest of the data by approximating their label with a linear combination of the
labels in S (using the induction formula presented in chapter 10). This can give
an improvement of up to O(n2/m2) in time and O(n/m) in memory usage, thus
allowing one to take advantage of significantly bigger unlabeled datasets than with
the original algorithms.
18.1 Introduction
54 Large-Scale Algorithms
18.2 Cost approximation
18.2.1 Reducing the cost using a subset
18.2.2 Resolution
18.2.3 Computational issues
18.3 Subset selection
18.3.1 Random selection
18.3.2 Smart data sampling
18.4 Discussion
19 Semi-Supervised Protein Classification using
Cluster Kernels
Jason Weston
Protein Classification is a classic case of the semi-supervised learning problem.
Biologists have built huge databases of proteins (amino acid sequences) of hundreds
of thousands of proteins, but the 3D structure and structural class (label) of only
a few thousand of these proteins is actually known. Kernel methods have proven
effective at the problem of protein classification (classifying proteins into known
families and superfamilies) by defining similiarity metrics between sequences.
In this chapter, we describe simple and scalable cluster kernel techniques for
incorporating unlabeled data into the representation of protein sequences. We show
that these methods greatly improve the classification performance of standard string
kernels and outperform naive approaches for using unlabeled data, such as adding
close homologs of the positive examples to the training data. These methods also
achieve equal or superior performance to previously presented cluster kernel methods
while achieving far greater computational efficiency.
19.1 Introduction
19.2 Representations and kernels for protein Sequences
19.3 Cluster Kernels for protein Sequences
(review section, this should mostly refer to elsewhere in the book I guess)
19.4 The neighborhood mismatch kernel
56 Semi-Supervised Protein Classification using Cluster Kernels
19.5 The bagged mismatch kernel
19.6 The profile kernel
19.7 Experiments
19.7.1 Transductive Setting
19.7.2 Semi-supervised setting
19.8 Discussion
20 Prediction of Protein Function from Networks
Hyunjung Shin
Koji Tsuda
Bernhard Scholkopf
In some application areas, it is natural to represent domain knowledge using graphs.
Frequently there exist multiple graphs for the same set of nodes, representing infor-
mation from different sources, and no single graph is sufficient to predict class labels
of unlabelled nodes reliably. One way to enhance reliability is to integrate multiple
graphs, since individual graphs are partly independent and partly complementary
to each other for prediction. In this chapter, we propose a novel algorithm to as-
sign weights to multiple graphs within graph-based semi-supervised learning. Both
predicting class labels and searching for combining weights are formulated into one
convex optimization problem. The proposed method is applied to functional class
prediction of yeast proteins. When compared with individual graphs, the combined
graph with optimized weights performs significantly better than any single graph.
When compared with the state-of-the-art kernel method, a semi-definite program-
ming based support vector machines method (SDP/SVM), it shows comparable ac-
curacy in a remarkably short time. Compared with fixed-weight-integration (without
optimization for weight), it selects important graphs over redundant ones, which
implies the desirable property of integration with selectivity.
20.1 Introduction
58 Prediction of Protein Function from Networks
20.2 Graph-based Semi-Supervised Learning
20.3 Combining Multiple Graphs
20.3.1 Regularized Version
20.3.2 Optimization
20.4 Experiments on Function Prediction of Proteins
20.4.1 Comparison with Individual Graphs
20.4.2 Comparison with SDP/SVM
20.4.3 Comparison with Fixed Weight Integration
20.5 Conclusion and Outlook
VI Perspectives
60 Perspectives
21 An Augmented PAC Model for Semi-
Supervised Learning
Maria-Florina Balcan
Avrim Blum
The standard PAC-learning model has proven to be a useful theoretical framework
for thinking about the problem of supervised learning. However, it does not tend
to capture many of the assumptions underlying semi-supervised learning methods.
In this chapter we describe an augmented version of the PAC model designed with
semi-supervised learning in mind, that can be used to help think about the problem of
learning from labeled and unlabeled data and many of the different approaches taken.
The model provides a unified framework for analyzing when and why unlabeled data
can help, in which one can discuss both sample-complexity and algorithmic issues.
Our model can be viewed as an extension of the standard PAC model, where in
addition to a concept class C, one also proposes a compatibility function: a type of
compatibility that one believes the target concept should have with the underlying
distribution of data. For example, it could be that one believes the target should
cut through a low-density region of space, or that it should be self-consistent in
some way as in co-training. This belief is then explicitly represented in the model.A PAC Model for
Semi-Supervised
Learning
Unlabeled data is then potentially helpful in this setting because it allows one to
estimate compatibility over the space of hypotheses, and to reduce the size of the
search space from the whole set of hypotheses C down to those that, according to
one’s assumptions, are a-priori reasonable with respect to the distribution.
After proposing the model, we then analyze sample-complexity issues in this
setting: that is, how much of each type of data one should expect to need in order
to learn well, and what are the basic quantities that these numbers depend on. We
provide examples of sample-complexity bounds both for uniform convergence and
ε-cover based algorithms, as well as several algorithmic results.
62 An Augmented PAC Model for Semi-Supervised Learning
21.1 Introduction
As we have already seen in the previous chapters, there has been growing interest in
practice in using unlabeled data together with labeled data in machine learning, and
a number of different approaches have been developed. However, the assumptions
these methods are based on are often quite distinct and not captured by standard
theoretical models.
One difficulty from a theoretical point of view is that standard discriminative
learning models do not really capture how and why unlabeled data can be of
help. In particular, in the PAC model there is purposefully a complete disconnect
between the data distribution D and the target function f being learned [Valiant,
1984, Blumer et al., 1989, Kearns and Vazirani, 1994]. The only prior belief is
that f belongs to some class C: even if D is known fully, any function f ∈ C
is still possible. For instance, it is perfectly natural (and common) to talk about
the problem of learning a concept class such as DNF formulas [Linial et al., 1989,
Verbeurgt, 1990] or an intersection of halfspaces [Baum, 1990, Blum and Kannan,
1997, Vempala, 1997, Klivans et al., 2002] over the uniform distribution; but clearly
in this case unlabeled data is useless — you can just generate it yourself. For learning
over an unknown distribution (the standard PAC setting), unlabeled data can help
somewhat, by allowing one to use distribution-specific sample-complexity bounds,
but this does not seem to fully capture the power of unlabeled data in practice.
In generative-model settings, one can easily talk theoretically about the use
of unlabeled data, e.g., [Castelli and Cover, 1995, 1996]. However, these results
typically make strong assumptions that essentially imply that there is only one
natural distinction to be made for a given (unlabeled) data distribution. For
instance, a typical generative-model setting would be that we assume positive
examples are generated by one Gaussian, and negative examples are generated by
another Gaussian. In this case, given enough unlabeled data, we could recover the
Gaussians and would need labeled data only to tell us which Gaussian is the positive
one and which is the negative one.1 This is too strong an assumption for most real-
world settings. Instead, we would like our model to allow for a distribution over
data (e.g., documents we want to classify) where there are a number of plausible
distinctions we might want to make.2 In addition, we would like a general framework
that can be used to model many different uses of unlabeled data.
We present here a PAC-style framework that bridges between these positions
and we believe can be used to help think about many of the ways unlabeled data
1. Castelli and Cover [1995, 1996] do not assume Gaussians in particular, but they doassume the distributions are distinguishable, which from our perspective has the sameissue.2. In fact, there has been recent work in the generative model setting on the practical sidethat goes in this direction (see [Nigam et al., 2000, Nigam, 2001]). We discuss connectionsto generative models further in Section 21.5.2.
21.1 Introduction 63
is typically used, including approaches discussed in other chapters. This framework
extends the PAC model in a way that allows one to express not only the form
of target function one is considering, but also relationships that one hopes the
target function and underlying distribution will possess. We then analyze sample-
complexity issues in this setting: that is, how much of each type of data one should
expect to need in order to learn well, and also give examples of algorithmic results
in this model.
Specifically, the idea of the proposed model is to augment the PAC notion of a
concept class, which is a set of functions (like linear separators or decision trees),
with a notion of compatibility between a function and the data distribution that we
hope the target function will satisfy. Then, rather than talking of “learning a conceptMain Idea
class C,” we will talk of “learning a concept class C under compatibility notion χ.”
For example, suppose we believe there should exist a good linear separator, and that
furthermore, if the data happens to cluster, then this separator probably does not
slice through the middle of any such clusters. Then we would want a compatibility
notion that penalizes functions that do, in fact, slice through clusters. In this
framework, the extent to which unlabeled data helps depends on two quantities:
first, the extent to which the true target function satisfies the given assumption,
and second, the extent to which the distribution allows this assumption to rule
out alternative hypotheses. For instance, if the data does not cluster at all, then
all functions equally satisfy this compatibility notion and the assumption ends up
not helping. From a Bayesian perspective, one can think of this as a PAC model
for a setting in which one’s prior is not just over functions, but also over how the
function and underlying distribution relate to each other.
To make our model formal, we will need to ensure that the degree of compatibility
be something that can be estimated from a finite sample. To do this, we will require
that the compatibility notion χ actually be a function from C × X to [0, 1], where
the compatibility of a function f with the data distribution D is Ex∼D[χ(f, x)].
The degree of incompatibility is then something we can think of as a kind of
“unlabeled error rate” that measures how a-priori unreasonable we believe some
proposed hypothesis to be. For instance, in the example above of a “margin-style”
compatibility, we could define χ(f, x) to be an increasing function of the distance of
x to the separator f . In this case, the unlabeled error rate, 1−χ(f,D), is a measure
of the probability mass close to the proposed separator. In co-training, where each
example x has two “views” (x = 〈x1, x2〉), the underlying belief is that the true
target c∗ can be decomposed into functions 〈c∗1, c∗2〉 over each view such that for
most examples, c∗1(x1) = c∗2(x2). In this case, we can define χ(〈f1, f2〉, 〈x1, x2〉) = 1
if f1(x1) = f2(x2), and 0 if f1(x1) 6= f2(x2). Then the compatibility of a hypothesis
〈f1, f2〉 with an underlying distribution D is Pr〈x1,x2〉∼D[f1(x1) = f2(x2)].
This setup allows us to analyze the ability of a finite unlabeled sample to reduce
our dependence on labeled examples, as a function of the compatibility of the target
function (i.e., how correct we were in our assumption) and various measures of the
“helpfulness” of the distribution. In particular, in our model, we find that unlabeled
data can help in several distinct ways.
64 An Augmented PAC Model for Semi-Supervised Learning
If the target function is highly compatible with D, then if we have enough
unlabeled data to estimate compatibility over all f ∈ C, we can in principle reduce
the size of the search space from C down to just those f ∈ C whose estimated
compatibility is high. For instance, if D is “helpful”, then the set of such functions
will be much smaller than the entire set C.
By providing an estimate of D, unlabeled data can allow us to use a more refinedWays in which
unlabeled data
can help
distribution-specific notion of “hypothesis space size” such as Annealed VC-entropy
[Devroye et al., 1996], Rademacher complexities [Koltchinskii, 2001, Bartlett and
Mendelson, 2002, Boucheron et al., 2004] or the size of the smallest ε-cover [Benedek
and Itai, 1991], rather than VC-dimension [Blumer et al., 1989, Kearns and Vazirani,
1994]. In fact, for natural cases (such as those above) we find that the sense in
which unlabeled data reduces the “size” of the search space is best described in
these distribution-specific measures.
Finally, if the distribution is especially nice, we may find that not only does the
set of “compatible” f ∈ C have a small ε-cover, but also the elements of the cover
are far apart. In that case, if we assume the target function is fully compatible, we
may be able to learn from even fewer labeled examples than the 1/ε needed just to
verify a good hypothesis! (Though here D is effectively committing to the target
as in generative models.)
Our framework also allows us to address the issue of how much unlabeled data
we should expect to need. Roughly, the “VCdim/ε2” form of standard PAC sample
complexity bounds now becomes a bound on the number of unlabeled examples we
need. However, technically, the set whose VC-dimension we now care about is not
C but rather a set defined by both C and χ: that is, the overall complexity depends
both on the complexity of C and the complexity of the notion of compatibility (see
Section 21.3.1.2). One consequence of our model is that if the target function and
data distribution are both well behaved with respect to the compatibility notion,
then the sample-size bounds we get for labeled data can substantially beat what
one could hope to achieve through pure labeled-data bounds, and we illustrate this
with a number of examples through the chapter.
21.2 A Formal Framework
In this section we formally introduce what we mean by a notion of compatibility,
and illustrate it through a number of examples including margins and co-training.
We assume that examples (both labeled and unlabeled) come according to a
fixed unknown distribution D over an instance space X, and they are labeled
by some unknown target function c∗. As in the standard PAC model, a concept
class or hypothesis space is a set of functions over the instance space X, and we
will often make the assumption (the “realizable case”) that the target function
belongs to a given class C. For a given hypothesis f , the (true) error rate of f is
defined as err(f) = errD(f) = Prx∼D[f(x) 6= c∗(x)]. For any two hypotheses
21.2 A Formal Framework 65
f1, f2 ∈ C, the distance with respect to D between f1 and f2 is defined as
d(f1, f2) = dD(f1, f2) = Prx∼D[f1(x) 6= f2(x)]. We will use err(f) to denote
the empirical error rate of f on a given labeled sample and d(f1, f2) to denote the
empirical distance between f1 and f2 on a given unlabeled sample.
We define a notion of compatibility to be a mapping from a hypothesis f and a
distribution D to [0, 1] indicating how “compatible” f is with D. In order for this to
be estimable from a finite sample, we require that compatibility be an expectation
over individual examples. (Though one could imagine more general notions with
this property as well.) Specifically, we define:Legal notion of
compatibilityDefinition 21.1 A legal notion of compatibility is a function χ : C × X → [0, 1]
where we (overloading notation) define χ(f,D) = Ex∼D[χ(f, x)]. Given a sample
S, we define χ(f, S) to be the empirical average over the sample.
Remark 21.2 One could also allow compatibility functions over k-tuples of exam-
ples, in which case our (unlabeled) sample-complexity bounds would simply increase
by a factor of k. For settings in which D is actually known in advance (e.g., trans-
ductive learning, see Section 21.5.1) we can drop this requirement entirely and allow
any notion of compatibility χ(f,D) to be legal.
Definition 21.3 Given compatibility notion χ, the incompatibility of f with D is
1 − χ(f,D). We will also call this its unlabeled error rate, errunl(f), when χ and
D are clear from context. For a given sample S, we use errunl(f) to denote the
empirical average over S.
Finally, we need a notation for the set of functions whose incompatibility is at
most some given value τ .
Definition 21.4 Given threshold τ , we define CD,χ(τ) = f ∈ C : errunl(f) ≤ τ.
So, e.g., CD,χ(1) = C. Similarly, for a sample S, we define CS,χ(τ) = f ∈ C :
errunl(f) ≤ τ
We now illustrate our compatibility notion through a number of examples:
Example 1. Suppose examples are points in Rd and C is the class of linear
separators. A natural belief in this setting is that data should be “well-separated”:Margins
not only should the target function separate the positive and negative examples,
but it should do so by some reasonable margin γ. This is the assumption used by
Transductive SVM (see [Joachims, 1999] and also Chapter 6 in this book). In this
case, if we are given γ up front, we could define χ(f, x) = 1 if x is farther than
distance γ from the hyperplane defined by f , and χ(f, x) = 0 otherwise. So, the
incompatibility of f with D is probability mass within distance γ of f ·x = 0. Or we
could define χ(f, x) to be a smooth function of the distance of x to the separator,
if we do not want to commit to a specific γ in advance. (In contrast, defining
compatibility of a hypothesis based on the largest γ such that D has probability
mass exactly zero within distance γ of the separator would not fit our model: it
66 An Augmented PAC Model for Semi-Supervised Learning
cannot be written as an expectation over individual examples and indeed would
not be a good definition since one cannot distinguish “zero” from “exponentially
close to zero” from a small sample of unlabeled data).
Example 2. In co-training [Blum and Mitchell, 1998], we assume examples come
as pairs 〈x1, x2〉, and our goal is to learn a pair of functions 〈f1, f2〉. For instance,
if our goal is to classify web pages, x1 might represent the words on the page itselfCo-training
and x2 the words attached to links pointing to this page from other pages. The
hope that underlies co-training is that the two parts of the example are consistent,
which then allows the co-training algorithm to bootstrap from unlabeled data. For
example, iterative co-training uses a small amount of labeled data to get some initial
information (e.g., if a link with the words “my advisor” points to a page then that
page is probably a faculty member’s home page) and then when it finds an unlabeled
example where one half is confident (e.g., the link says “my advisor”), it uses that to
label the example for training its hypothesis over the other half. This approach and
several variants have been used for a variety of learning problems, including named
entity classification [Collins and Singer, 1999], text classification [Nigam and Ghani,
2000, Ghani, 2001], natural language processing [Pierce and Cardie, 2001], large
scale document classification [Park and Zhang, 2003], and visual detectors [Levin
et al., 2003].3 As mentioned in Section 21.1, the assumptions underlying co-training
fit naturally into our framework. In particular, we can define the incompatibility of
some hypothesis 〈f1, f2〉 with distribution D as Pr〈x1,x2〉∼D[f1(x1) 6= f2(x2)].
Example 3. In transductive graph-based methods, we are given a set of unlabeled
examples connected in a graph g, where the interpretation of an edge is that we
believe the two endpoints of the edge should have the same label. Given a few
labeled vertices, various graph-based methods then attempt to use them to infer
labels for the remaining points. If we are willing to view D as a distribution overGraph-based
methods edges (a uniform distribution if g is unweighted), then as in co-training we can
define the incompatibility of some hypothesis f as the probability mass of edges
that are cut by f , which then motivates various cut-based algorithms. For instance,
if we require f to be boolean, then the mincut method of Blum and Chawla [2001]
finds the most-compatible hypothesis consistent with the labeled data; if we allow f
to be fractional and define 1−χ(f, 〈x1, x2〉) = (f(x1)−f(x2))2, then the algorithm
of Zhu et al. [2003a] finds the most-compatible consistent hypothesis. If we do not
wish to view D as a distribution over edges, we could have D be a distribution over
vertices and broaden Definition 21.1 to allow for χ to be a function over pairs of
examples. In fact, as mentioned in Remark 21.2, since we have perfect knowledge
of D in this setting we can allow any compatibility function χ(f,D) to be legal. We
discuss more connections with graph-based methods in Section 21.5.1.
Example 4. As a special case of co-training, suppose examples are pairs of points
in Rd, C is the class of linear separators, and we believe the two points in each
3. For more discussion regarding co-training see also Chapter 2 in this book.
21.3 Sample Complexity Results 67
pair should both be on the same side of the target function. (So, this is a version
of co-training where we require f1 = f2.) The motivation is that we want to use
pairwise information as in Example 3, but we also want to use the features of
each data point. For instance, in the word-sense disambiguation problem studied
by Yarowsky [1995], the goal is to determine which of several dictionary definitions
is intended for some target word in a piece of text (e.g., is “plant” being used to
indicate a tree or a factory?). The local context around each word can be viewedLinear separator
graph cuts as placing it into Rd, but the edges correspond to a completely different type of
information: the belief that if a word appears twice in the same document, it is
probably being used in the same sense both times. In this setting, we could use the
same compatibility function as in Example 3, but rather than having the concept
class C be all possible functions, we reduce C to just linear separators.
Example 5. In a related setting to co-training, considered by Leskes [2005], ex-
amples are single points in X but we have a pair of hypothesis spaces 〈C1,C2〉 (orAgreement
more generally k-tuple 〈C1, . . . ,Ck〉), and the goal is to find a pair of hypotheses
〈f1, f2〉 ∈ C1 × C2 with low error over labeled data and that agree over the distri-
bution. For instance, if data is sufficiently “well-separated”, one might expect there
to exist both a good linear separator and a good decision tree, and one would like
to use this assumption to reduce the need for labeled data. In this case one could
define compatibility of 〈f1, f2〉 with D as Prx∼D[f1(x) = f2(x)], or the similar
notion given in [Leskes, 2005].
21.3 Sample Complexity Results
We now present several sample-complexity bounds that fall out of this framework,
showing how unlabeled data, together with a suitable compatibility notion, can
reduce the need for labeled examples.
The basic structure of all of these results is as follows. First, given enough unla-
beled data (where “enough” will be a function of some measure of the complexity
of C and possibly of χ as well), we can uniformly estimate the true compatibilities
of all functions in C by their empirical compatibilities over the sample. Then, by
using this quantity to give a preference ordering over the functions in C, we can re-
duce “C” down to “the set of functions in C whose compatibility is not much larger
than the true target function” in bounds for the number of labeled examples needed
for learning. The specific bounds differ in terms of the exact complexity measures
used (and a few other issues such as stratification and realizability) and we provide
examples illustrating when certain complexity measures can be significantly more
powerful than others. In particular, ε-cover bounds (Section 21.3.2) can provide
especially good bounds for co-training and graph-based settings.
68 An Augmented PAC Model for Semi-Supervised Learning
21.3.1 Uniform Convergence Bounds
We begin with uniform convergence bounds (later in Section 21.3.2 we give tighter
ε-cover bounds that apply to algorithms of a particular form). For clarity, we begin
with the case of finite hypothesis spaces where we measure the “size” of a set
of functions by just the number of functions in the set. We then discuss several
issues that arise when considering infinite hypothesis spaces, such as what is an
appropriate measure for the “size” of the set of compatible functions, and the
need to account for the complexity of the compatibility notion itself. Note that in
the standard PAC model, one typically talks of either the realizable case, where we
assume that c∗ ∈ C, or the agnostic case where we do not (see [Kearns and Vazirani,
1994]). In our setting, we have the additional issue of unlabeled error rate, and can
either make an a-priori assumption that the target function’s unlabeled error is low,
or else aim for a more “Occam-style” bound in which we have a stream of labeled
examples and halt once they are sufficient to justify the hypothesis produced.
21.3.1.1 Finite hypothesis spaces
We first give a bound for the “doubly realizable” case.
Theorem 21.5 If we see mu unlabeled examples and ml labeled examples, where
mu ≥1
ε
[ln |C| + ln
2
δ
]and ml ≥
1
ε
[ln |CD,χ(ε)| + ln
2
δ
],
then with probability at least 1 − δ, all f ∈ C with err(f) = 0 and errunl(f) = 0
have err(f) ≤ ε.
Proof The probability that a given hypothesis f with errunl(f) > ε has
errunl(f) = 0 is at most (1 − ε)mu < δ/(2|C|) for the given value of mu. There-
fore, by the union bound, the number of unlabeled examples is sufficient to ensure
that with probability 1 − δ/2, only hypotheses in CD,χ(ε) have errunl(f) = 0. The
number of labeled examples then similarly ensures that with probability 1 − δ/2,
none of those whose true error is at least ε have an empirical error of 0, yielding
the theorem.
So, if the target function indeed is perfectly correct and compatible, then Theo-
rem 21.5 gives sufficient conditions on the number of examples needed to ensure
that an algorithm that optimizes both quantities over the observed data will, in
fact, achieve a PAC guarantee. To emphasize this, we will say that an algorithm
efficiently PACunl-learns the pair (C, χ) if it is able to achieve a PAC guarantee
using time and sample sizes polynomial in the bounds of Theorem 21.5.
We can think of Theorem 21.5 as bounding the number of labeled examples we
need as a function of the “helpfulness” of the distribution D with respect to our
notion of compatibility. That is, in our context, a helpful distribution is one inInterpretation
which CD,χ(ε) is small, and so we do not need much labeled data to identify a good
21.3 Sample Complexity Results 69
function among them. We can get a similar bound in the situation when the target
function is not fully compatible:
Theorem 21.6 Given t ∈ [0, 1], if we see mu unlabeled examples and ml labeled
examples, where
mu ≥2
ε2
[ln |C| + ln
4
δ
]and ml ≥
1
ε
[ln |CD,χ(t + 2ε)| + ln
2
δ
],
then with probability at least 1− δ, all f ∈ C with err(f) = 0 and errunl(f) ≤ t + ε
have err(f) ≤ ε, and furthermore all f ∈ C with errunl(f) ≤ t have errunl(f) ≤
t + ε.
In particular, this implies that if errunl(c∗) ≤ t and err(c∗) = 0 then with high
probability the f ∈ C that optimizes err(f) and errunl(f) has err(f) ≤ ε.
Proof Same as Theorem 21.5 except apply Hoeffding bounds (see Devroye et al.
[1996]) to the unlabeled error rates.
Finally, we give a simple Occam/luckiness type of bound for this setting. Given
a sample S, let us define descS(f) = ln |CS,χ(errunl(f))|. That is, descS(f) is
the description length of f (in “nats”) if we sort hypotheses by their empiri-
cal compatibility and output the index of f in this ordering. Similarly, define ε-
descD(f) = ln |CD,χ(errunl(f) + ε)|. This is an upper-bound on the description
length of f if we sort hypotheses by an ε-approximation to the their true compati-
bility. Then we can get a bound as follows:
Theorem 21.7 For any set S of unlabeled data, given ml labeled examples, with
probability at least 1−δ, all f ∈ C satisfying err(f) = 0 and descS(f) ≤ εml−ln(1/δ)
have err(f) ≤ ε. Furthermore, if |S| ≥ 2ε2 [ln |C|+ln 2
δ ], then with probability at least
1 − δ, all f ∈ C satisfy descS(f) ≤ ε-descD(f).
The point of this theorem is that an algorithm can use observable quantities to
determine if it can be confident. Furthermore, if we have enough unlabeled data,Interpretation
the observable quantities will be no worse than if we were learning a slightly less
compatible function using an infinite-size unlabeled sample.
Note that if we begin with a non-distribution-dependent ordering of hypotheses,
inducing some description length desc(f), and our compatibility assumptions turn
out to be wrong, then it could well be that descD(c∗) > desc(c∗). In this case our
use of unlabeled data would end up hurting rather than helping.
21.3.1.2 Infinite hypothesis spaces
To reduce notation, we will assume in the rest of this chapter that χ(f, x) ∈ 0, 1
so that χ(f,D) = Prx∼D[χ(f, x) = 1]. However, all our sample complexity results
can be easily extended to the general case.
70 An Augmented PAC Model for Semi-Supervised Learning
For infinite hypothesis spaces, the first issue that arises is that in order to achieve
uniform convergence of unlabeled error rates, the set whose complexity we care
about is not C but rather χ(C) = χf : f ∈ C where we define χf (x) = χ(f, x). For
instance, suppose examples are just points on the line, and C = fa(x) : fa(x) = 1
iff x ≤ a. In this case, VCdim(C) = 1. However, we could imagine a compatibility
function such that χ(fa, x) depends on some complicated relationship between the
real numbers a and x. In this case, VCdim(χ(C)) is much larger, and indeed we
would need many more unlabeled examples to estimate compatibility over all of C.
A second issue is that we need an appropriate measure for the “size” of the set
of surviving functions. VC-dimension tends not to be a good choice: for instance,
if we consider the case of Example 1 (margins), then even if data is concentrated
in two well-separated “blobs”, the set of compatible separators still has as large a
VC-dimension as the entire class even though they are all very similar with respect
to D. Instead, it is better to consider distribution dependent complexity measures
such as annealed VC-entropy and Rademacher averages. We give in the following
several examples of bounds of these type. For this we introduce some notation.
Specifically, for any C, we denote by C[m,D] the expected number of splits of m
points (drawn i.i.d.) from D with concepts in C. Also, for a given (fixed) S ⊆ X,
we will denote by S the uniform distribution over S, and by C[m,S] the expected
number of splits of m points (drawn i.i.d.) from S with concepts in C. Also for
a given sample S of n points from D, let’s denote by Rn(C) [S] the conditional
Rademacher average of C given by S. Then we can get bounds as follows:
Theorem 21.8 An unlabeled sample of size
mu = O
(V Cdim (χ(C))
ε2log
1
ε+
1
ε2log
2
δ
)
and a labeled sample of size
ml >2
ε
[log(2s) + log
2
δ
], where s = CD,χ(t + 2ε)[2ml,D]
is sufficient so that with probability 1−δ, all f ∈ C with err(f) = 0 and errunl(f) ≤
t + ε have err(f) ≤ ε, and furthermore all f ∈ C have |errunl(f) − errunl(f)| ≤ ε.
This is the analog of Theorem 21.6 for the infinite case. In particular, this implies
that if err(c∗) = 0 and errunl(c∗) ≤ t, then with high probability the f ∈ C thatInterpretation
optimizes err(f) and errunl(f) has err(f) ≤ ε.
Proof Sketch: By standard VC-bounds [Devroye et al., 1996, Vapnik, 1998], the
number of unlabeled examples is sufficient to ensure that with probability 1−δ/2 we
can estimate, within ε, Prx∈D[χf (x) = 1] for all χf ∈ χ(C). Since χf (x) = χ(f, x),
this implies we have can estimate, within ε, the unlabeled error rate errunl(f) for
all f ∈ C, and so the set of hypotheses with errunl(f) ≤ t + ε is contained in
CD,χ(t + 2ε).
The bound on the number of labeled examples follows from [Devroye et al., 1996]
(where it is shown that the expected number of partitions can be used instead of
21.3 Sample Complexity Results 71
the maximum in the standard VC proof). This bound ensures that with probability
1− δ/2, none of the functions in CD,χ(t + 2ε) whose true (labeled) error is at least
ε have an empirical (labeled) error of 0.
We can also give a bound where we specify the number of labeled examples as a
function of the unlabeled sample; this is useful because we can imagine our learning
algorithm performing some calculations over the unlabeled data and then deciding
how many labeled examples to purchase.
Theorem 21.9 Given t ≥ 0, an unlabeled sample S of size
O
(max[V Cdim(C), V Cdim(χ(C))]
ε2log
1
ε+
1
ε2log
2
δ
)
is sufficient so that if we label ml examples drawn uniformly at random from S,
where
ml >4
ε
[log(2s) + log
2
δ
]and s = CS,χ(t + ε)
[2ml, S
]
then with probability ≥ 1− δ, all f ∈ C with err(f) = 0 and errunl(f) ≤ t + ε have
err(f) ≤ ε. Furthermore all f ∈ C have |errunl(f) − errunl(f)| ≤ ε.
Proof Standard VC-bounds (in the same form as for Theorem 21.8) imply that
the number of labeled examples ml is sufficient to guarantee the conclusion of the
theorem with “err(f)” replaced by “errS(f)” (the error with respect to S) and “ε”
replaced with “ε/2”. The number of unlabeled examples is enough to ensure that,
with probability ≥ 1−δ/2, for all f ∈ C, |err(f)−errS(f)| ≤ ε/2. Combining these
two statements yields the theorem.
So, if err(c∗) = 0 and errunl(c∗) ≤ t, then with high probability the f ∈ C that
optimizes err(f) and errunl(f) has err(f) ≤ ε. If we assume errunl(c∗) = 0 then
we can use CS,χ(0) instead of CS,χ(t + ε).
Notice that for the case of Example 1, in the worst case (over distributions D)
this will essentially recover the standard margin sample-complexity bounds. InInterpretation
particular, CS,χ(0) contains only those separators that split S with margin ≥ γ,
and therefore, s is no greater than the maximum number of ways of splitting 2ml
points with margin γ. However, if the distribution is nice, then the bounds can be
much better because there may be many fewer ways of splitting S with margin γ.
For instance, in the case of two well-separated “blobs” discussed above, if S is large
enough, we would have just s = 4.
We can finally give a stratified version of Theorem 21.9 as follows:
Theorem 21.10 An unlabeled sample S of size
O
(max[V Cdim(C), V Cdim(χ(C))]
ε2log
1
ε+
1
ε2log
2
δ
)
is sufficient so that with probability ≥ 1 − δ we have that simultaneously for every
k ≥ 0 the following is true: if we label mk examples drawn uniformly at random
72 An Augmented PAC Model for Semi-Supervised Learning
from S, where
mk >4
ε
[log(2s) + log
2(k + 1)(k + 2)
δ
]and s = CS,χ((k + 1)ε)
[2mk, S
]
then all f ∈ C with err(f) = 0 and errunl(f) ≤ (k + 1)ε have err(f) ≤ ε.
This theorem is an analog of Theorem 21.7 and it essentially justifies a stratifica-
tion based on the estimated unlabeled error rates. We can also imagine having data
dependent bounds for both labeled and unlabeled data, and also doing a double
stratification, with respect to both labeled and unlabeled error rates. In particular,
we can derive a bound as follows:
Theorem 21.11 An unlabeled sample S of size
O
(max[V Cdim(C), V Cdim(χ(C))]
ε2log
1
ε+
1
ε2log
2
δ
)
is sufficient so that with probability ≥ 1 − δ we have that simultaneously for every
i ≥ 0, k ≥ 0 the following is true: if we label mk examples drawn uniformly at
random from S, where
mk,i >8
ε2
[log(2s) + log
4(k + 1)(k + 2)(i + 1)(i + 2)
δ
]and s = CS,χ((k+1)ε)
[2mk, S
]
then all f ∈ C with err(f) ≤ (i + 1)ε and errunl(f) ≤ (k + 1)ε have err(f) ≤ i · ε.
We can similarly derive a bound using Rademacher averages as follows4.
Theorem 21.12 An unlabeled sample S of size
O
(max[V Cdim(C), V Cdim(χ(C))]
ε2log
1
ε+
1
ε2log
2
δ
)
is sufficient so that with probability ≥ 1 − δ we have that simultaneously for every
i ≥ 0, k ≥ 0 the following is true: if the number of labeled examples mk,i satisfies
mk,i >16
ε2
[2Rmk
(CS,χ((k + 1)ε)[Sk]) + 2 log8(k + 1)(k + 2)(i + 1)(i + 2)
δ
]
then all f ∈ C with err(f) ≤ (i + 1)ε and errunl(f) ≤ (k + 1)ε have err(f) ≤ i · ε.
21.3.2 ε-Cover-based Bounds
The bounds in the previous section are for uniform convergence: they provide
guarantees for any algorithm that optimizes well on the observed data. In this
section, we consider stronger bounds based on ε-covers that can be obtained for
4. For different versions of these bounds using recent stronger bounds([Boucheron et al.,2000], [Boucheron et al., 2004]) see [Balcan and Blum, 2005].
21.3 Sample Complexity Results 73
algorithms that behave in a specific way: they first use the unlabeled examples to
choose a “representative” set of compatible hypotheses, and then use the labeled
sample to choose among these. Bounds based on ε-covers exist in the classical PAC
setting, but in our framework these bounds and algorithms of this type are especially
natural and convenient.
Recall that a set Cε ⊆ 2X is an ε-cover for C with respect to D if for every f ∈ C
there is a f ′ ∈ Cε which is ε-close to f . That is, Prx∼D(f(x) 6= f ′(x)) ≤ ε.
To illustrate how this can produce stronger bounds, consider the setting of
Example 3 (graph-based algorithms) where the graph g consists of two cliques of
n/2 vertices, connected together by o(n2) edges (in particular, the number of edges
connecting the cliques is small compared to εn2). Suppose the target function labels
one of the cliques as positive and one as negative, and we define compatibility of
a hypothesis to be the fraction of edges cut by it (so the target function indeed
has unlabeled error rate less than ε). Now, given any set SL of ml εn labeledExamples where
ε-cover bounds
beat uniform
convergence
bounds
examples, there is always a highly-compatible hypothesis consistent with SL that
just separates the positive points in SL from the entire rest of the graph: the number
of edges cut will be at most nml εn2. However, such a hypothesis clearly has
high true error since it is so unbalanced. So, we do not have uniform convergence.
On the other hand, the set of functions of unlabeled error rate less than ε/4 has
a small ε-cover: in particular, any partition of g that cuts less than εn2/4 edges
must be ε-close to (a) the all-positive function, (b) the all-negative function, (c) the
target function c∗, or (d) the complement of the target function 1− c∗. So, ε-cover
bounds require much fewer labeled examples.5
For another case where ε-cover bounds can beat uniform-convergence bounds,
imagine examples are pairs of points in 0, 1d, C is the class of linear separators,
and compatibility is determined by whether both points are on the same side
of the separator (i.e., the case of Example 4). Now suppose for simplicity that
the target function just splits the hypercube on the first coordinate, and the
distribution is uniform over pairs having the same first coordinate (so the target is
fully compatible). It is not hard to show that given polynomially many unlabeled
examples SU and 14 log d labeled examples SL, with high probability there will exist
high-error functions consistent with SL and compatible with SU .6 So, we do not
5. Effectively, ε-cover bounds allow one to rule out a hypothesis that, say, just separatesthe positive points in SL from the rest of the graph by noting that this hypothesis is veryclose (with respect to D) to the all-negative hypothesis, and that hypothesis has a highlabeled-error rate.6. Proof: Let V be the set of all variables that (a) appear in every positive example of SL
and (b) appear in no negative example of SL. Over the draw of SL, each variable has a(1/2)2|SL| = 1/
√
d chance of belonging to V , so with high probability V has size at least1
2
√
d. Now, consider the hypothesis corresponding to the conjunction of all variables in V .This correctly classifies the examples in SL, and w.h.p. it classifies every other examplein SU negative because each example in SU has only a 1/2|V | chance of satisfying everyvariable in V , and the size of SU is much less than 2|V |. So, this means it is compatiblewith SU and consistent with SL, even though its true error is high.
74 An Augmented PAC Model for Semi-Supervised Learning
yet have uniform convergence. In contrast, the cover-size of the set of functions
compatible with SU is constant, so ε-cover based bounds allow learning from just
a constant number of labeled examples.
In particular, we can give an ε-cover based bound as follows.
Theorem 21.13 If t is an upper bound for errunl(c∗) and p is the size of a
minimum ε − cover for CD,χ(t + 4ε), then using mu unlabeled examples and ml
labeled examples for
mu = O
(V Cdim (χ(C))
ε2log
1
ε+
1
ε2log
2
δ
)and ml = O
(1
εln
p
δ
),
we can with probability 1 − δ identify a hypothesis which is 10ε close to c∗.
Proof Sketch: First, given the unlabeled sample SU , define Hε ⊆ C as follows: for
every labeling of SU that is consistent with some f in C, choose a hypothesis in
C for which errunl(f) is smallest among all the hypotheses corresponding to that
labeling. Next, we obtain Cε by eliminating from Hε those hypotheses f with the
property that errunl(f) > t + 3ε. We then apply a greedy procedure on Cε, and we
obtain Gε = g1, · · · , gs, as follows:
Initialize H1ε = Cε and i = 1.
1. Let gi = argminf∈Hi
ε
errunl(f).
2. Using unlabeled data, determine Hi+1ε by crossing out from Hi
ε those hypotheses
f with the property that d(gi, f) < 3ε.
3. If Hi+1ε = ∅ then set s = i and stop; else, increase i by 1 and goto 1.
Our bound on mu is sufficient to ensure that, with probability ≥ 1 − δ/2, Hε is
an ε-cover of C, which implies that, with probability ≥ 1− δ/2, Cε is an ε-cover for
CD,χ(t). It is then possible to show Gε is, with probability ≥ 1− δ/2, a 5ε-cover for
CD,χ(t) of size at most p. The idea here is that by greedily creating a 3ε-cover of
Cε with respect to distribution SU , we are creating a 4ε-cover of Cε with respect to
D, which is a 5ε-cover of CD,χ(t) with respect to D. Furthermore, we are doing this
using no more functions than would a greedy 2ε-cover procedure for CD,χ(t + 4ε)
with respect to D, which is no more than the optimal ε-cover of CD,χ(t + 4ε).
Now to learn c∗ we use labeled data and we do empirical risk minimization on
Gε. By standard bounds (see for instance [Benedek and Itai, 1991]), the number of
labeled examples is enough to ensure that with probability ≥ 1− δ/2 the empirical
optimum hypothesis in Gε has true error at most 10ε. This implies that overall,
with probability ≥ 1 − δ, we find a hypothesis of error at most 10ε.
As an interesting case where unlabeled data helps substantially, consider a
co-training setting where the target c∗ is fully compatible and D satisfies the
conditional independence given the label property. As shown by Blum and Mitchell
[1998], one can boost any weak hypothesis from unlabeled data in this setting
21.3 Sample Complexity Results 75
(assuming one has enough labeled data to produce a weak hypothesis). 7. We can
actually show that given enough unlabeled data, in fact we can learn from just
a single labeled example. Specifically, it is possible to show that for any concept
classes C1 and C2, we have:
Theorem 21.14 Assume that err(c∗) = errunl(c∗) = 0 and D satisfies indepen-
dence given the label. Then using mu unlabeled examples and ml labeled examples
we can find a hypothesis that with probability 1 − δ has error at most ε, provided
that
mu = O
(1
ε·
[(V Cdim(C1) + V Cdim(C2)) · ln
(1
ε
)+ ln
(1
δ
)])
and
ml = O
(log( 1
ε )
(1
δ
))
ε-Cover Bounds
for Co-trainingProof Sketch: For convenience we will show a bound with 6ε instead of ε, 3δ
instead of δ, and we will assume for simplicity the setting of Example 3, where
c∗ = c∗1 = c∗2 and also that D1 = D2 = D (the general case is handled similarly,
but just requires more notation). We first characterize the hypotheses with true
unlabeled error rate at most ε. Recall that χ(f,D) = Pr〈x1,x2〉∼D[f(x1) = f(x2)],
and for concreteness assume f predicts using x1 if f(x1) 6= f(x2). Consider f ∈ C
with errunl(f) ≤ ε and let’s define p− = Prx∈D [c∗(x) = 0], p+ = Prx∈D [c∗(x) = 1]
and for i, j ∈ 0, 1 define pij = Prx∈D [f(x) = i, c∗(x) = j]. We clearly have
err (f) = p10 + p01. From errunl(f) = Pr(x1,x2)∼D [f (x1) 6= f (x2)] ≤ ε, using
the independence given the label of D, we get 2p10p00
p−
+ 2p01p11
p+≤ ε. This implies
that the almost compatible hypothesis f must be of one the following four types:
1. f is “close to c∗” or more exactly err(f) ≤ 2ε.
2. f is “close to the opposite of c∗” or more exactly err(f) ≥ 1 − 2ε.
3. f “predicts almost always negative” or more exactly p10 + p11 ≤ 3ε.
4. f “predicts almost always positive” or more exactly p01 + p00 ≤ 3ε.
Now, consider f1 to be the constant positive function, f0 to be the constant negative
function. The unlabeled sample SU is sufficient to ensure that probability ≥ 1 − δ,
every hypothesis with zero estimated unlabeled error has true unlabeled error at
most ε. Therefore, by our previous analysis, there are only four kinds of hypotheses
consistent with unlabeled data: those close to c∗, those close to its complement
c∗, those close to f0, and those close to f1. Furthermore, c∗, c∗, f0, and f1 are
compatible with the unlabeled data.
We now check if there exists a hypothesis g ∈ C with errunl(g) = 0 such that
df1,g ≥ 4ε and df0,g ≥ 4ε. If such a hypothesis g exists, then we know that one of
7. Related sample complexity results are given in [Dasgupta et al., 2001]
76 An Augmented PAC Model for Semi-Supervised Learning
g, g, where g is the opposite of g, is 2ε-close to c∗. If not, we must have p+ ≤ 6ε
or p− ≤ 6ε, in which case we know that one of f0, f1 is 6ε-close to c∗. So, we
have a set of two functions, opposite to each other, one of which is at least 6ε-close
to c∗. We now use labeled data to pick one of these to output, using Lemma 21.15
below.
Lemma 21.15 Consider ε < 18 . Let Cε =
f, f
be a subset of C containing
two opposite hypotheses with the property that one of them is ε-close to c∗. Then,
ml > 6 log( 1ε )
(1δ
)labeled examples are sufficient so that with probability ≥ 1 − δ,
the concept in Cε that is ε-close to c∗ in fact has lower empirical error.
Proof Easy calculation: if ml > 6 log 1ε
(1δ
), then
bml2
c∑k=0
(ml
k
)ε(ml−k) (1 − ε)
k ≤ δ.
In particular, by reducing ε to poly(δ), we can reduce the number of labeled
examples needed ml to 1. In fact, this result can be extended to the case considered
in [Balcan et al., 2004] that D+ and D− merely satisfy constant expansion.
This example illustrates that if data is especially well behaved with respect to
the compatibility notion, then our bounds on labeled data can be extremely good.
More interesting, as shown in section 21.4.2, for the case of linear separators and
independence given the label, we can give efficient algorithms, achieving the bounds
in Theorem 21.14 in terms of labeled examples by a polynomial time algorithm.
21.4 Algorithmic Results
In this section we give several examples of efficient algorithms in our model.
21.4.1 A simple case
We give here a simple example to illustrate the bounds in Section 21.3.1.1, and for
which we can give a polynomial-time algorithm that takes advantage of them. Let
the instance space X = 0, 1d, and for x ∈ X, let vars(x) be the set of variables
set to 1 by x. Let C be the class of monotone disjunctions (e.g., x1 ∨ x3 ∨ x6),
and for f ∈ C, let vars(f) be the set of variables disjoined by f . Now, suppose we
say an example x is compatible with function f if either vars(x) ⊆ vars(f) or else
vars(x) ∩ vars(f) = φ. This is a very strong notion of “margin”: it says, in essence,
that every variable is either a positive indicator or a negative indicator, and no
example should contain both positive and negative indicators.
Given this setup, we can give a simple PACunl-learning algorithm for this pair
(C, χ). We begin by using our unlabeled data to construct a graph on d vertices (one
per variable), putting an edge between two vertices i and j if there is any example
x in our unlabeled sample with i, j ∈ vars(x). We now use our labeled data to
label the components. If the target function is fully compatible, then no component
will get multiple labels (if some component does get multiple labels, we halt with
21.4 Algorithmic Results 77
failure). Finally, we produce the hypothesis f such that vars(f) is the union of the
positively-labeled components. This is fully compatible with the unlabeled data and
has zero error on the labeled data, so by Theorem 21.5, if the sizes of the data sets
are as given in the bounds, with high probability the hypothesis produced will have
error ≤ ε.
Notice that if we want to view the algorithm as “purchasing” labeled data, then
we can simply examine the graph, count the number of connected components k,
and then request 1ε [k ln 2 + ln 2
δ ] labeled examples. (Here, 2k = |CS,χ(0)|.) By the
proof of 21.5, with high probability 2k ≤ |CD,χ(ε)|, so we are purchasing no more
than the number of labeled examples in the theorem statement.
Also, it is interesting to see the difference between a “helpful” and “non-helpful”
distribution for this problem. An especially non-helpful distribution would be the
uniform distribution over all examples x with |vars(x)| = 1, in which there are d
components. In this case, unlabeled data does not help at all, and one still needs
Ω(d) labeled examples (or, even Ω(d/ε) if the distribution is a non-uniform as in
VC-dimension lower bounds [Ehrenfeucht et al., 1989]). On the other hand, a helpful
distribution is one such that with high probability the number of components is
small, such as the case of features appearing independently given the label.
21.4.2 Co-training with linear separators
We now consider the case of co-training where the hypothesis class is the class of
linear separators. For simplicity we focus first on the case of Example 4: the target
function is a linear separator in Rd and each example is a pair of points, both of
which are assumed to be on the same side of the separator (i.e., an example is a
line-segment that does not cross the target hyperplane). We then show how our
results can be extended to the more general setting.
As in the previous example, a natural approach is to try to solve the “consistency”
problem: given a set of labeled and unlabeled data, our goal is to find a separator
that is consistent with the labeled examples and compatible with the unlabeled ones
(i.e., it gets the labeled data correct and doesn’t cut too many edges). Unfortunately,
this consistency problem is NP-hard: given a graph g embedded in Rd with two
distinguished points s and t, it is NP-hard to find the linear separator that cuts
the minimum number of edges, even if the minimum is 0 [Flaxman, 2003]. For
this reason, we will make an additional assumption, that the two points in an
example are each drawn independently given the label. That is, there is a single
distribution D over Rd, and with some probability p+, two points are drawn
i.i.d. from D+ (D restricted to the positive side of the target function) and with
probability 1 − p+, the two are drawn i.i.d from D− (D restricted to the negative
side of the target function). Note that our sample complexity results in sectionNeed to assume
independence for
our algorithmic
results
21.3.2 extend to weaker assumptions such as distributional expansion introduced
by Balcan et al. [2004], but we need true independence for our algorithmic results.
Blum and Mitchell [1998] have also given positive algorithmic results for co-training
when (a) the two halves of an example are drawn independently given the label
78 An Augmented PAC Model for Semi-Supervised Learning
(which we are assuming now), (b) the underlying function is learnable via Statistical
Query algorithms8 (which is true for linear separators by Blum et al. [1998]), and
(c) we have enough labeled data to produce a weakly-useful hypothesis on one of
the halves to begin with. A weakly-useful predictor is defined to be a predictor f
such that:
Pr[f(x) = 1|c∗(x) = 1] > Pr[f(x) = 1|c∗(x) = 0] + ε.
It is equivalent to the usual notion of a “weak hypothesis”(see [Kearns and Vazirani,
1994]) when the target function is balanced, but requires the hypothesis give more
information when the target function is unbalanced — see [Blum and Mitchell,
1998].
We show here how we can run the algorithm in [Blum and Mitchell, 1998] with
only a single labeled example, thus obtaining an efficient algorithm in our model.
It is worth noticing that in the process, we also simplify the results of Blum et al.
[1998] somewhat.
Theorem 21.16 There is a polynomial-time algorithm (in d and b, where b is the
number of bits per example) to learn a linear separator under the above assumptions,
from a polynomial number of unlabeled examples and a single labeled example.
Proof Sketch: Assume for convenience that the target separator passes through the
origin, and let us denote the separator by c∗ · x = 0. We will also assume for
convenience that PrD(c∗(x) = 1) ∈ [ε/2, 1− ε/2]; that is, the target function is not
overwhelmingly positive or overwhelmingly negative (if it is, this is actually an easy
case, but it makes the arguments more complicated). Define the margin of some
point x as the distance of x/|x| to the separating plane, or equivalently, the cosine
of the angle between c∗ and x.
We begin by drawing a large unlabeled sample S =〈xi
1, xi2〉
; denote by Sj
the setxi
j
, for j = 1, 2. (We describe our algorithm as working with the fixed
unlabeled sample S, since we just need to apply standard VC-dimension arguments
to get the desired result.) The first step is to perform a transformation T on S1
to ensure that some reasonable (1/poly) fraction of T (S1) has margin at least
1/poly, which we can do via the Outlier Removal Lemma of Blum et al. [1998]
and Dunagan and Vempala [2001].9 The Outlier Removal Lemma states that one
can algorithmically remove an ε′ fraction of S1 and ensure that for the remainder,
for any vector w, maxx∈S1(w · x)2 ≤ poly(n, b, 1/ε′)Ex∈S1
[(w · x)2], where b is the
number of bits needed to describe the input points. We reduce the dimensionality
(if necessary) to get rid of any of the vectors for which the above quantity is zero.
We then determine a linear transformation (as described in Blum et al. [1998]) so
8. For a detailed description of the Statistical Query model see [Kearns, 1998] and [Kearnsand Vazirani, 1994].9. If the reader is willing to allow running time polynomial in the margin of the data set,then this part of the argument is not needed.
21.4 Algorithmic Results 79
that in that in the transformed space for all unit-length w, Ex∈T (S1)[(w · x)2] = 1).
Since the maximum is bounded, this guarantees that at least a 1/poly fraction of
the points in T (S1) have at least a 1/poly margin with respect to the separating
hyperplane.
To avoid cumbersome notation in the rest of the discussion, we drop our use
of “T” and simply use S and c∗ to denote the points and separator in the
transformed space. (If the distribution originally had a reasonable probability mass
at a reasonable margin from c∗, then T could be the identity anyway.)
The second step is we argue that a random halfspace has at least a 1/poly chance
of being a weak predictor on S1. ([Blum et al., 1998] uses the Perceptron algorithm
to get weak learning; here, we need something simpler since we do not yet have any
labeled data.) Specifically, consider a point x such that the angle between x and c∗
is π/2− γ, and imagine that we draw f at random subject to f · c∗ ≥ 0 (half of the
f ’s will have this property). Then,
Prf (f(x) 6= c∗(x)|f · c∗ ≥ 0) = (π/2 − γ)/π = 1/2 − γ/π.
Since at least a 1/poly fraction of the points in S1 have at least a 1/poly margin
this implies that:
Prf,x[f(x) = 1|c∗(x) = 1] > Prf,x[f(x) = 1|c∗(x) = 0] + 1/poly.
This means that a 1/poly probability mass of functions f must in fact be weakly-
useful predictors.
The final step of the algorithm is as follows. Using the above observation, we pick
a random f , and plug it into the bootstrapping theorem of [Blum and Mitchell,
1998] (which, given unlabeled pairs 〈xi1, x
i2〉 ∈ S, will use f(xi
1) as a noisy label of
xi2, feeding the result into an SQ algorithm), repeating this process poly(n) times.
With high probability, our random f was a weakly-useful predictor on at least one
of these steps, and we end up with a low-error hypothesis. For the rest of the runs
of the algorithm, we have no guarantees. We now observe the following. First of
all, any function f with small err(f) must have small errunl(f). Secondly, because
of the assumption of independence given the label, as shown in theorem 21.14, the
only functions with low unlabeled error rate are functions close to c∗, close to ¬c∗,
close to the “all positive” function, or close to the “all negative” function.
So, if we simply examine all the hypotheses produced by this procedure, and pick
some h with a low unlabeled error rate that is at least ε/2-far from the “all-positive”
or “all-negative” functions, then either f or ¬f is close to c∗. We can now just draw
a single labeled example to determine which case is which.
We can easily extend our algorithm to the standard co-training (where c∗1 can
be different from c∗2) as follows: we repeat the procedure in a symmetric way, and
then, in order to find a good pair of functions, just try all combinations of pairs
of functions to find one of small unlabeled error rate, not close to “all positive”,
or “all negative” functions; finally use a constant number of labeled examples to
produce a low error hypothesis (and here we use only one part of the example and
80 An Augmented PAC Model for Semi-Supervised Learning
only one of the functions in the pair).
21.5 Related Models and Discussion
21.5.1 A Transductive Analog of our Model
We can also talk about a transductive analog of our (inductive) model, that
incorporates many of the existing transductive methods for learning with labeled
and unlabeled data. In a transductive setting one assumes that the unlabeled sample
S is given, a random small subset is labeled, and the goal is to predict well on
the rest of S. In order to make use of unlabeled examples, we will again express
the relationship we hope the target function has with the distribution through a
compatibility notion χ. However, since in this case the compatibility between a
given hypothesis and D is completely determined by S (which is known), we will
not need to require that compatibility be an expectation over unlabeled examples.
Given this setup, from the sample complexity point of view we only care about how
much labeled data we need, and algorithmically we need to find a highly compatible
hypothesis with low error on the labeled data.
Rather than presenting general theorems, we instead focus on the modeling
aspect and give here several examples in the context of graph-based semi-supervised
algorithms for binary classification. In these methods one usually assumes that there
is weighted graph g defined over S, which is given a-priori and encodes the prior
knowledge. In the following we denote by W the weighted adjacency matrix of g
and by CS the set of all binary functions over S.
Minimum cut Suppose for f ∈ CS we define the incompatibility of f to be the
weight of the cut in g determined by f . This is the implicit notion of compatibilityMinimum cut
considered in [Blum and Chawla, 2001], and algorithmically the goal is to find the
most compatible hypothesis that gets the labeled data correct, which can be solved
efficiently using network flow. From a sample-complexity point of view, the number
of labeled examples we need is proportional to the VC-dimension of the class of
hypotheses that are at least as compatible as the target function, which is known
to be O(k/λ) (see [Kleinberg, 2000], [Kleinberg et al., 2004]), where k is the number
of edges cut by c∗ and λ is the size of the global minimum cut in the graph. Also note
that the Randomized Mincut algorithm (considered by Blum et al. [2004]), which
is an extension of the basic mincut approach is actually motivated by a PAC-Bayes
sample complexity analysis of the problem.
Normalized Cut Consider the normalized cut setting of Joachims [2003] and for
f ∈ CS define size(f) to be the weight of the cut in g determined by f , and
let fneg and fpos be the number of points in S on which h predicts negative and
positive, respectively. For f ∈ CS , define the incompatibility of f to be size(f)fneg·fpos
.
Note that this is the implicit compatibility function used in Joachims [2003], and
again, algorithmically the goal would be to find a highly compatible hypothesis
21.5 Related Models and Discussion 81
that gets the labeled data correct. Unfortunately, the corresponding optimizationNormalized
Graph Cuts with
Constraints
problem is in this case is NP-hard. Still, several “approximate” solutions have been
considered, leading to different semi-supervised learning algorithms; for instance
Joachims considers in [Joachims, 2003] a spectral relaxation that leads to the “SGT
algorithm”; another relaxation based on Semi-Definite programming is considered
by Bie and Cristianini [2004]10.
Harmonic Function We can also model the algorithms introduced in [Zhu et al.,
2003a], [Zhu et al., 2003b] as follows. If we consider f to be a probabilistic
prediction function defined over S, then the incompatibility of f is given by∑i,j
wi,j (f(i) − f(j))2
= fT Lf , where L is the un-normalized Laplacian of g.Gaussian
Random Field
and Harmonic
Function
Similarly we can model the algorithm introduced by Zhou et al. [2004] by noticing
that the incompatibility of f is given by fT Lf where L is the normalized Laplacian
of g. More generally, all the Graph Kernel methods can be viewed in our framework
if we consider that the incompatibility of f is given by ||f ||K = fT Kf where K is
a kernel derived from the graph (see for instance [Zhu et al., 2003c]).
21.5.2 Connections to Generative Models
It is also interesting to consider how generative models fit into our model. As
mentioned in Section 21.1, a typical assumption in a generative setting is that D is
a mixture with the probability density function p(x|θ) = p0 · p0(x|θ0)+ p1 · p1(x|θ1)
(see for instance [Ratsaby and Venkatesh, 1995], [Castelli and Cover, 1995, 1996]).
That means that the labeled examples are generated according to the following
mechanism: a label y ∈ 0, 1 is drawn according to the distribution of classes
p0, p1 and then a corresponding random feature vector is drawn according to
the class-conditional density py. The assumption typically used is that the mixture
is identifiable. Identifiability ensures that the Bayes optimal decision border x :
p0 · p0(x|θ0) = p1 · p1(x|θ1) can be deduced if p(x|θ) is known, and therefore one
can construct an estimate of the Bayes border by using p(x|θ) instead of p(x|θ).
Essentially once the decision border is estimated, a small labeled sample suffices to
learn (with high confidence and small error) the appropriate class labels associated
with the two disjoint regions generated by the estimate of the Bayes decision border.How the
generative models
fit into our model
To see how we can incorporate this setting in our model, consider for illustration the
setting in Ratsaby and Venkatesh [1995]; there they assume that p0 = p1, and that
the class conditional densities are d-dimensional Gaussians with unit covariance and
unknown mean vectors θi ∈ Rd. The algorithm used is the following: the unknown
parameter vector θ = (θ0, θ1) is estimated from unlabeled data using a maximum
likelihood estimate; this determines a hypothesis which is a linear separator that
passes through the point (θ0 + θ1)/2 and is orthogonal to the vector θ1 − θ0; finally
each of the two decision regions separated by the hyperplane is labeled according
10. For a more detailed discussion on this see also Chapter 7 in this book.
82 An Augmented PAC Model for Semi-Supervised Learning
to the majority of the labeled examples in the region. Given this setting, a natural
notion of compatibility we can consider is the expected log-likelihood function
(where the expectation is taken with respect to the unknown distribution specified
by θ). Specifically, we can identify a legal hypothesis fθ with the set of parameters
θ = (θ0, θ1) that determine it, and then we can define χ(fθ,D) = Ex∈D[log(p(x|θ))].
Ratsaby and Venkatesh [1995] show that if the unlabeled sample is large enough,
then all hypotheses specified by parameters θ which are close enough to θ, will have
the property that their empirical compatibilities will be close enough to their true
compatibilities. This then implies (together with other observations about Gaussian
mixture) that the maximum likelihood estimate will be close enough to θ, up to
permutations. (This actually motivates χ as a good compatibility function in our
model.)
More generally, if we deal with other parametric family (but we are in the
same setting), we can use the same compatibility notion; however, we will need
to impose certain constraints on the distributions allowed in order to ensure that
the compatibility is actually well defined (the expected log-likelihood is bounded).
As mentioned in section 21.1 this kind of generative setting is really at the extreme
of our model. The assumption that the distribution that generates the data is really
a mixture implies that if we knew the distribution, then there are only two possible
concepts left (and this makes the unlabeled data extremely useful).
21.5.3 Connections to the Luckiness Framework
It is worth noticing that there is a strong connection between our approach and
the luckiness framework (see [Shawe-Taylor et al., 1998], [Mendelson and Philips,
2003]). In both cases, the idea is to define an ordering of hypotheses that dependsRelationship to
the luckiness
framework
on the data, in the hope that we will be “lucky” and find that not too many other
functions are as compatible as the target. There are two main differences, however.
The first is that the luckiness framework (being designed for supervised learning
only) uses labeled data both for estimating compatibility and for learning: this is a
more difficult task, and as a result our bounds on labeled data can be significantly
better. For instance, in Example 4 described in Section 21.2 for any non-degenerate
distribution, a dataset of d/2 pairs can with probability 1 be completely shattered by
fully-compatible hypotheses, so the luckiness framework does not help. In contrast,
with a larger (unlabeled) sample, one can potentially reduce the space of compatible
functions quite significantly, and learn from o(d) or even O(1) labeled examples
depending on the distribution – see Section 21.3.2 and 21.4. Secondly, the luckiness
framework talks about compatibility between a hypothesis and a sample, whereas
we define compatibility with respect to a distribution. This allows us to talk about
the amount of unlabeled data needed to estimate true compatibility. There are also
a number of differences at the technical level of the definitions.
21.5 Related Models and Discussion 83
21.5.4 Conclusions
There has been a widespread disagreement and no clear understanding about
when semi-supervised learning helps and by how much. In this chapter, we have
provided a PAC-style model that incorporates both labeled and unlabeled data,
that captures many of the ways unlabeled data is typically used, and provides a
unified framework for thinking about when and why unlabeled data can help. The
high level main implication of our analysis is that unlabeled data is useful if (a)
we have a good notion of compatibility so that the target function indeed has a
low unlabeled error rate, (b) the distribution D is helpful in the sense that not
too many other hypotheses also have a low unlabeled error rate, and (c) we have
enough unlabeled data to estimate unlabeled error rates well. One consequence of our
model is that if the target function and data distribution are both well behaved with
respect to the compatibility notion, then the sample-size bounds we get for labeled
data can substantially beat what one could hope to achieve through pure labeled-
data bounds, and we have illustrated this with a number of examples through the
chapter.
22 Metric Based Approaches for Semi-Supervised
Regression
Dale Schuurmans
Finnegan Southey
Semi-supervised learning methods typically require an explicit relationship to be
asserted between the labelled and unlabelled data, as illustrated, for example, by
the neighbourhoods used in graph-based methods. We present semi-supervised model
selection and regularization methods that require instead only that the labelled and
unlabelled data are drawn from the same distribution. From this assumption a metric
can be constructed over hypotheses based solely on their predictions for unlabelled
data. This metric can then be used to detect untrustworthy training error estimates,
leading to model selection strategies that select the richest hypothesis class while
providing theoretical guarantees against overfitting. The general approach is then
adapted to regularization for supervised regression. This regularization adapts not
only to the hypothesis class but also to the specific data provided, allowing for better
performance than regularizers that account only for class complexity.
22.1 Metrics for Supervised Learning
22.2 Model Selection
22.3 Regularization
22.4 Connections to Active Learning
22.5 Classification
References
Y. S. Abu-Mostafa. Machines that learn from hints. Scientific American, 272(4):64–69, 1995.
A. K. Agrawala. Learning with a probabilistic teacher. IEEE Transactions on Information Theory,16:373–379, 1970.
M.-F. Balcan and A. Blum. An augmented PAC model for semi-supervised learning. Manuscript,2005.
M.-F. Balcan, A. Blum, and K. Yang. Co-training and expansion: Towards bridging theory andpractice. In NIPS, 2004.
P. Bartlett and S. Mendelson. Rademacher and gaussian complexities risk bounds and structuralresults. Journal of Machine Learning Research, pages 463–482, 2002.
E. B. Baum. Polynomial time algorithms for learning neural nets. In Proceedings of the thirdannual workshop on Computational learning theory, pages 258 – 272, 1990.
G.M. Benedek and A. Itai. Learnability with respect to a fixed distribution. Theoretical ComputerScience, 86:377–389, 1991.
T. De Bie and N. Cristianini. Convex transduction with the normalized cut. Manuscript, 2004.
A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. InProceedings of the Eighteenth International Conference on Machine Learning (ICML 2001),pages 19–26, 2001.
A. Blum and R. Kannan. Learning an intersection of k halfspaces over a uniform distribution.Journal of Computer and Systems Sciences, 54(2):371–380, 1997.
A. Blum and T. M. Mitchell. Combining labeled and unlabeled data with co-training. InProceedings of the Eleventh Annual Conference on Computational Learning Theory, pages92–100, 1998.
A. Blum, A. Frieze, R. Kannan, and S. Vempala. A polynomial-time algorithm for learning noisylinear threshold functions. Algorithmica, 22:35–52, 1998.
A. Blum, J. Lafferty, R. Reddy, and M. R. Rwebangira. Semi-supervised learning using randomizedmincuts. In ICML ’04, 2004.
A. Blumer, A. Ehrenfeucht, D. Haussler, and M. K. Warmuth. Learnability and the VapnikChervonenkis dimension. Journal of the ACM, 36(4):929–965, 1989.
S. Boucheron, G. Lugosi, and P. Massart. A sharp concentration inequality with applications.Random Structures and Algorithms, 16:277–292, 2000.
S. Boucheron, O. Bousquet, and G. Lugosi. Theory of classification: A survey of recent advances.Manuscript, 2004.
V. Castelli and T.M. Cover. On the exponential value of labeled samples. Pattern RecognitionLetters, 16:105–111, 1995.
V. Castelli and T.M. Cover. The relative value of labeled and unlabeled samples in patternrecognition with an unknown mixing parameter. IEEE Transactions on Information Theory,42(6):2102–2117, 1996.
M. Collins and Y. Singer. Unsupervised models for named entity classification. In Proceedingsof the Joint SIGDAT Conference on Empirical Methods in Natural Language Processing andVery Large Corpora, pages 189–196, 1999.
S. Dasgupta, M. L. Littman, and D. McAllester. Pac generalization bounds for co-training. InT. G. Dietterich, S. Becker, and Z. Ghahramani, editors, Advances in Neural InformationProcessing Systems 14, Cambridge, MA, 2001. MIT Press.
L. Devroye, L. Gyorfi, and G. Lugosi. A Probabilistic Theory of Pattern Recognition. Springer-
88 REFERENCES
Verlag, 1996.
J. Dunagan and S. Vempala. Optimal outlier removal in high-dimensional spaces. In Proceedingsof the 33rd ACM Symposium on Theory of Computing, 2001.
A. Ehrenfeucht, D. Haussler, M. Kearns, and L. Valiant. A general lower bound on the numberof examples needed for learning. Inf. and Comput, 82:246–261, 1989.
A. Flaxman. Personal communication, 2003.
S. C. Fralick. Learning to recognize patterns wothout a teacher. IEEE Transactions onInformation Theory, 13:57–64, 1967.
R. Ghani. Combining labeled and unlabeled data for text classification with a large number ofcategories. In Proceedings of the IEEE International Conference on Data Mining, 2001.
T. Joachims. Transductive learning via spectral graph partitioning. In Proceedings of theInternational Conference on Machine Learning (ICML 2003), 2003.
T. Joachims. Transductive inference for text classification using support vector machines. InProceedings of the Sixteenth International Conference on Machine Learning (ICML 1999),pages 200–209, Bled, Slovenia, 1999. Morgan Kaufmann.
M. Kearns. Efficient noise-tolerant learning from statistical queries. In Journal of the ACM(JACM), pages 983 – 1006, 1998.
M. Kearns and U. Vazirani. An Introduction to Computational Learning Theory. MIT Press,1994.
J. Kleinberg. Detecting a network failure. In Proceedings of the 41st IEEE Symposium onFoundations of Computer Science, pages 231–239, 2000.
J. Kleinberg, M. Sandler, and A. Slivkins. Network failure detection and graph connectivity. InProceedings of the 41st IEEE Symposium on Foundations of Computer Science, pages 231–239,2004.
A. R. Klivans, R. O’Donnell, and R. Servedio. Learning intersections and thresholds of halfspaces.In Proceedings of the 43rd Symposium on Foundations of Computer Science, pages 177–186,2002.
V. Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Trans. Inform.Theory, pages 1902–1914, 2001.
B. Leskes. The value of agreement, a new boosting algorithm. In COLT, pages 51 – 56, 2005.
A. Levin, P. Viola, and Y. Freund. Unsupervised improvement of visual detectors using co-training. In Proceedings of the Nineth IEEE International Conference on Computer Vision(ICCV 2003), pages 626–633, Nice, France, 2003. IEEE.
N. Linial, Y. Mansour, and N. Nisan. Constant depth circuits, fourier transform, and learnability.In Proceedings of the Thirtieth Annual Symposium on Foundations of Computer Science, pages574–579, Research Triangle Park, North Carolina, October 1989.
S. Mendelson and P. Philips. Random subclass bounds. In Proceedings of the 16th AnnualConference on Computational Learning Theory (COLT), 2003.
K. Nigam. Using unlabeled data to improve text classification. Technical Report DoctoralDisertation, CMU-CS-01-126, Carnegie Mellon University, 2001.
K. Nigam and R. Ghani. Analyzing the effectiveness and applicability of co-training. In Proc. ACMCIKM Int. Conf. on Information and Knowledge Management, pages 86–93, 2000.
K. Nigam, A. McCallum, S. Thrun, and T. M. Mitchell. Text classification from labeled andunlabeled documents using EM. Machine Learning, 39(2/3):103–134, 2000.
S. Park and B. Zhang. Large scale unstructured document classification using unlabeled data andsyntactic information. In PAKDD 2003, LNCS vol. 2637, pages 88–99. Springer, 2003.
D. Pierce and C. Cardie. Limitations of Co-Training for natural language learning from largedatasets. In Proc. Conference on Empirical Methods in NLP, pages 1–9, 2001.
J. Ratsaby and S. Venkatesh. Learning from a mixture of labeled and unlabeled examples withparametric side information. In Proceedings of the Eighth Annual Conference on ComputationalLearning Theory, pages 412–417, 1995.
H. J. Scudder. Probability of error of some adaptive pattern-recognition machines. IEEETransactions on Information Theory, 11:363–371, 1965.
J. Shawe-Taylor, P. L. Bartlett, R. C. Williamson, and M. Anthony. Structural risk minimizationover data-dependent hierarchies. IEEE Transactions on Information Theory, 44(5):1926–1940,
REFERENCES 89
1998.
L.G. Valiant. A theory of the learnable. Commun. ACM, 27(11):1134–1142, 1984.
V. N. Vapnik. Statistical Learning Theory. John Wiley and Sons Inc., 1998.
S. Vempala. A random sampling based algorithm for learning the intersection of half-spaces. InProceedings of the 38th Symposium on Foundations of Computer Science, pages 508–513, 1997.
K. A. Verbeurgt. Learning dnf under the uniform distribution in quasi-polynomial time. In COLT,pages 314–326, 1990.
D. Yarowsky. Unsupervised word sense disambiguation rivaling supervised methods. In Meetingof the Association for Computational Linguistics, pages 189–196, 1995.
D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Schlkopf. Learning with local and globalconsistency. In NIPS, 2004.
X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning using gaussian fields andharmonic functions. In The Twentieth International Conference on Machine Learning, pages912–912, Washington, DC, USA, 2003a. AAAI Press.
X. Zhu, Z. Ghahramani, and J. Lafferty. Combinig active learning and semi-supervised learningusing gaussian fields and harmonic functions. In ICML-2003 Workshop on The Continuumfrom Labeled to Unlabeled Data in Machine Learning, pages 912–912, Washington, DC, USA,2003b.
X. Zhu, Z. Ghahramani, and J. Lafferty. Semi-supervised learning: From gaussian fields to gaussianprocesses. Technical report, Carnegie Mellon University, 2003c.
Index
characteristic function, 93
Dirac δ, 93
notation table, 91
symbol list, 91
Notation and Symbols
Sets of Numbers
N the set of natural numbers, N = 1, 2, . . .
R the set of reals
[n] compact notation for 1, . . . , n
x ∈ [a, b] interval a ≤ x ≤ b
x ∈ (a, b] interval a < x ≤ b
x ∈ (a, b) interval a < x < b
|C| cardinality of a set C (for finite sets, the number of elements)
Data
X the input domain
d (used if X is a vector space) dimension of X
M number of classes (for classification)
l, u number of labeled, unlabeled training examples
n total number of examples, n = l + u.
i, j indices, often running over [l] or [n]
xi input patterns xi ∈ X
yi classes yi ∈ [M ] (for regression: target values yi ∈ R)
X a sample of input patterns, X = (x1, . . . , xn)
Y a sample of output targets, Y = (y1, . . . , yn)
Xl labeled part of X, Xl = (x1, . . . , xl)
Yl labeled part of Y , Yl = (y1, . . . , yl)
Xu unlabeled part of X, Xu = (xl+1, . . . , xl+u)
Yu unlabeled part of Y , Yu = (yl+1, . . . , yl+u)
Kernels
H feature space induced by a kernel
Φ feature map, Φ : X → H
k (positive definite) kernel
K kernel matrix or Gram matrix, Kij = k(xi, xj)
Vectors, Matrices and Norms
1 vector with all entries equal to one
I identity matrix
A> transposed matrix (or vector)
A−1 inverse matrix (in some cases, pseudo-inverse)
tr (A) trace of a matrix
det (A) determinant of a matrix
〈x,x′〉 dot product between x and x′
‖·‖ 2-norm, ‖x‖ :=√〈x,x〉
‖·‖p p-norm , ‖x‖p :=(∑N
i=1 |xi|p)1/p
, N ∈ N ∪ ∞
‖·‖∞ ∞-norm , ‖x‖∞ := supNi=1 |xi|, N ∈ N ∪ ∞
Functions
ln logarithm to base e
log2 logarithm to base 2
f a function, often from X or [n] to R, RM or [M ]
F a family of functions
Lp(X) function spaces, 1 ≤ p ≤ ∞
Probability
P· probability of a logical formula
P(C) probability of a set (event) C
p(x) density evaluated at x ∈ X
E [·] expectation of a random variable
Var [·] variance of a random variable
N(µ, σ2) normal distribution with mean µ and variance σ2
Graphs
g graph g = (V,E) with nodes V and edges E
G set of graphs
W weighted adjacency matrix of a graph (Wij 6= 0 ⇔ (i, j) ∈ E)
D (diagonal) degree matrix of a graph, Dii =∑
j Wij
L normalized graph Laplacian, L = D−1/2WD−1/2
L un-normalized graph Laplacian, L = D − W
SVM-related
ρf (x, y) margin of function f on the example (x, y), i.e., y · f(x)
ρf margin of f on the training set, i.e., minmi=1 ρf (xi, yi)
h VC dimension
C regularization parameter in front of the empirical risk term
λ regularization parameter in front of the regularizer
w weight vector
b constant offset (or threshold)
αi Lagrange multiplier or expansion coefficient
βi Lagrange multiplier
α,β vectors of Lagrange multipliers
ξi slack variables
ξ vector of all slack variables
Q Hessian of a quadratic program
Miscellaneous
IA characteristic (or indicator) function on a set A
i.e., IA(x) = 1 if x ∈ A and 0 otherwise
δij Kronecker δ (δij = 1 if i = j, 0 otherwise)
δx Dirac δ, satisfying∫
δx(y)f(y)dy = f(x)
O(g(n)) a function f(n) is said to be O(g(n)) if there exist constants C > 0
and n0 ∈ N such that |f(n)| ≤ Cg(n) for all n ≥ n0
o(g(n)) a function f(n) is said to be o(g(n)) if there exist constants c > 0
and n0 ∈ N such that |f(n)| ≥ cg(n) for all n ≥ n0
rhs/lhs shorthand for “right/left hand side”
the end of a proof