co-training ling 572 fei xia 02/21/06. overview proposed by blum and mitchell (1998) important work:...
Post on 21-Dec-2015
218 Views
Preview:
TRANSCRIPT
Overview
• Proposed by Blum and Mitchell (1998)
• Important work:– (Nigam and Ghani, 2000)– (Goldman and Zhou, 2000)– (Abney, 2002)– (Sarkar, 2002)– …
• Used in document classification, parsing, etc.
Outline
• Basic concept: (Blum and Mitchell, 1998)
• Relation with other SSL algorithms: (Nigam and Ghani, 2000)
An example
• Web-page classification: e.g., find homepages of faculty members.– Page text: words occurring on that page
e.g., “research interest”, “teaching”
– Hyperlink text: words occurring in hyperlinks that point to that page:
e.g., “my advisor”
Two views
• Features can be split into two sets:– The instance space:– Each example:
• D: the distribution over X• C1: the set of target functions over X1.
• C2: the set of target function over X2.
21 XXX
),( 21 xxx
11 Cf
22 Cf
Assumption #1: compatibility
• The instance distribution D is compatible with the target function f=(f1, f2) if for any x=(x1, x2) with non-zero prob, f(x)=f1(x1)=f2(x2).
• The compatibility of f with D:
)]()(:),[(Pr1 221121 xfxfxxp D
Each set of features is sufficient for classification
Co-training algorithm (cont)
• Why uses U’, in addition to U?– Using U’ yields better results.– Possible explanation: this forces h1 and h2 select
examples that are more representative of the underlying distribution D that generates U.
• Choosing p and n: the ratio of p/n should match the ratio of positive examples and negative examples in D.
• Choosing the iteration number and the size of U’.
Intuition behind the co-training algorithm
• h1 adds examples to the labeled set that h2 will be able to use for learning, and vice verse.
• If the conditional independence assumption holds, then on average each added document will be as informative as a random document, and the learning will progress.
Experiments: setting
• 1051 web pages from 4 CS depts– 263 pages (25%) as test data– The remaining 75% of pages
• Labeled data: 3 positive and 9 negative examples• Unlabeled data: the rest (776 pages)
• Manually labeled into a number of categories: e.g., “course home page”.
• Two views:– View #1 (page-based): words in the page– View #2 (hyperlink-based): words in the hyperlinks
• Learner: Naïve Bayes
Experiment: results
Page-based
classifier
Hyperlink-based classifier
Combined classifier
Supervised
training
12.9 12.4 11.1
Co-training 6.2 11.6 5.0
p=1, n=3# of iterations: 30|U’| = 75
Questions
• Can co-training algorithms be applied to datasets without natural feature divisions?
• How sensitive are the co-training algorithms to the correctness of the assumptions?
• What is the relation between co-training and other SSL methods (e.g., self-training)?
EM
• Pool the features together.
• Use initial labeled data to get initial parameter estimates.
• In each iteration use all the data (labeled and unlabeled) to re-estimate the parameters.
• Repeat until converge.
Experimental results: WebKB course database
EM performs better than co-trainingBoth are close to supervised method when trained on more labeled data.
Another experiment: The News 2*2 dataset
• A semi-artificial dataset
• Conditional independence assumption holds.
Co-training outperforms EM and the “oracle” result.
Co-training vs. EM
• Co-training splits features, EM does not.
• Co-training incrementally uses the unlabeled data.
• EM probabilistically labels all the data at each round; EM iteratively uses the unlabeled data.
Co-EM: EM with feature split
• Repeat until converge– Train A-feature-set classifier using the labeled
data and the unlabeded data with B’s labels – Use classifier A to probabilistically label all the
unlabeled data– Train B-feature-set classifier using the labeled
data and the unlabeled data with A’s labels.– B re-labels the data for use by A.
Random feature split
Co-training: 3.7% 5.5%Co-EM: 3.3% 5.1%
When the conditional independence assumption does not hold, but there is sufficient redundancy among the features, co-training still works well.
Assumptions
• Assumptions made by the underlying classifier (supervised learner):– Naïve Bayes: words occur independently of each other, given
the class of the document.– Co-training uses the classifier to rank the unlabeled examples by
confidence.– EM uses the classifier to assign probabilities to each unlabeled
example.
• Assumptions made by SSL method:– Co-training: conditional independence assumption.– EM: maximizing likelihood correlates with reducing classification
errors.
Summary of (Nigam and Ghani, 2002)
• Comparison of four SSL methods: self-training, co-training, EM, co-EM.
• The performance of the SSL methods depends on how well the underlying assumptions are met.
• Random splitting features is not as good as natural splitting, but it still works if there is sufficient redundancy among features.
Variations of co-training
• Goldman and Zhou (2000) use two learners of different types but both takes the whole feature set.
• Zhou and Li (2005) use three learners. If two agree, the data is used to teach the third learner.
• Balcan et al. (2005) relax the conditional independence assumption with much weaker expansion condition.
An alternative?
• L L1, LL2• U U1, U U2• Repeat
– Train h1 using L1 on Feat Set1– Train h2 using L2 on Feat Set2– Classify U2 with h1 and let U2’ be the subset with the
most confident scores, L2 + U2’ L2, U2-U2’ U2– Classify U1 with h2 and let U1’ be the subset with the
most confident scores, L1 + U1’ L1, U1-U1’ U1
Yarowsky’s algorithm
• one-sense-per-discourse
View #1: the ID of the document that a word is in
• one-sense-per-allocation
View #2: local context of word in the document
• Yarowsky’s algorithm is a special case of co-training (Blum & Mitchell, 1998)
• Is this correct? No, according to (Abney, 2002).
Summary of co-training
• The original paper: (Blum and Mitchell, 1998)– Two “independent” views: split the features into two
sets.– Train a classifier on each view.– Each classifier labels data that can be used to train
the other classifier.
• Extension: – Relax the conditional independence assumptions– Instead of using two views, use two or more
classifiers trained on the whole feature set.
Summary of SSL
• Goal: use both labeled and unlabeled data.
• Many algorithms: EM, co-EM, self-training, co-training, …
• Each algorithm is based on some assumptions.
• SSL works well when the assumptions are satisfied.
top related