robustness through prior knowledge: using explanation-based learning to distinguish handwritten...
TRANSCRIPT
![Page 1: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/1.jpg)
Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish
Handwritten Chinese Characters
Gerald DeJongComputer Science
University of Illinois at [email protected]
Qiang Sun, Shiau Hong Lim, Li-Lun Wang
![Page 2: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/2.jpg)
Challenges of Noisy Unstructured Text Data
• Noise – working with real input– Bottom-up limitations– Some true noise– Some self-induced variability– More reliant on prior structure
• Lack of structure – problem complexity– Top-down limitations– Highly structured = little variability – More reliant on input (noisy or otherwise)
![Page 3: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/3.jpg)
Noise• True noise
– Missing information– Extra information– Random / Normal(?)
• Induced noise– Imperfect representation
• Pixelization• Staircasing• Extra / missing blobs or pixels
– Variability• Unmodeled / approximated world dynamics • Ignored parameters / covariates • Not random• Convenient to pretend it is true noise…
![Page 4: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/4.jpg)
Structure vs. Unstructured
Call me Ishmael. Some years ago - never mind how long precisely - having little or no money in my purse, and nothing particular to interest me on shore, I thought I would sail about a little and see the watery part of the world. It is a way I have of driving off the spleen, and regulating the circulation…
Relatively unstructured:
Very structured:
With more structure, less induced noise
Name: Ishmael .
Finances: Low .
Problem:Bored, Spleen .
Date: Recent? .
![Page 5: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/5.jpg)
Unstructured: Deal with the Noise
• With structure programming problem• Without structure learning problem• Learn signal from noise via training examples
– Each training example contains little information– Is there enough information?– Task dependent
• Difficulty: Subtlety of required processing• Two statistical NLP question types:
– “How large is Brazil?”– “Will the Fed raise interest rates?”– Second requires integrating lots of partial evidence
![Page 6: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/6.jpg)
Machine Learning as an Empirically Guided Search through a Hypothesis Space
--
+
+
+
-
Example Space X with Training Set Z Hypothesis Space H
--
+
![Page 7: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/7.jpg)
What Makes a Learning Problem Hard?• Expressiveness of hypothesis space H
• Large / Diverse / Complex H: – More bad hypothesis can masquerade as good– More training examples are required for desired confidence
• Want high confidence that a learner will produce a good approximation of the true concept
• Cost: More information More training examples
* *
![Page 8: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/8.jpg)
Explanation Based LearningInformation Beyond Training Examples
• Utilize existing domain knowledge
• Treat training examples as illustrations of a deeper pattern
• Explain how the assigned class label may arise from an example’s properties
• Explanations suggest the deeper patterns
• Calibrate and confirm using other training examples
![Page 9: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/9.jpg)
Two Kinds of Prior Knowledge
• Solution Knowledge is directly relevant to a specific classification task.
– Can be readily used to bias a learning system.
– But it requires the expert to already know the solution and to possess expertise about the machine learner and its bias space.
• Domain Knowledge is more abstract and not tied to any particular classification task.
– “The same pen will leave similar-width strokes.”
– Only indirectly helpful for telling a “3” from a “6”
– Easy for human experts to articulate.
– Difficult to express in a statistical learner’s bias vocabulary
![Page 10: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/10.jpg)
Solution vs. Domain Knowledge
• 3 vs 8– Right half: little information– Left half: much more information
• Solution knowledge: “pay attention to the left half”
• Domain knowledge– Prior idealized stroke representations:– Conjecture differential information– Calibrate & Verify with training data
• EBL: – Derive solution knowledge – Use domain knowledge – Interacting with training examples
3 8
![Page 11: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/11.jpg)
The Explanation-Based Learning Approach
Transform Domain Knowledge into Solution Knowledge.
• Conjecture explanations for some training labels using Domain Knowledge.
• Evaluate explanation quality using the rest of the training set.• Assemble statistically confirmed explanations into Solution
Knowledge.
• Adjust the statistical learner’s bias to reflect the new Solution Knowledge.
domainknowledge
trainingexamples
EBLsolution
knowledgeinductivelearner
classifier
Learnerdomainknowledge
trainingexamples
EBLsolution
knowledgeinductivelearner
classifier
Learner
![Page 12: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/12.jpg)
SVM Background(Support Vector Machines)
• Generic: few parameters to manipulate• Linear AND nonlinear
– Linear in a high dimensional dot product space– Nonlinear in the input feature space
• Expressiveness: nonlinear• Cost: linear (+ convex optimization)• Two cute nuggets:
– Large margin: prefer low capacity / reduce overfitting– Kernel function (Kernel “trick”): compact, efficient,
expressive
![Page 13: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/13.jpg)
Handwritten Digitsan ML success story(?)
• Pixel input, e.g.:• 32 32 8 bits• x = 1024 dimensions, 256 values• Multi-class classifiers
– Ten index classifiers 1vAll– Four Boolean encoders– All pairs w/ voting– …
• Generic ANNs work poorly• Generic SVMs work better• Specially designed ANNs work
well* • Well: < 0.5% overall
(LeCun et al, ’98; Simard et al ‘03)
We are interested in generic solutions
![Page 14: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/14.jpg)
Class Information
• Let x be the vector of image pixels:x = {x1, x2, x3,… x1024}
• Distributed– No crucial input pixel– Class c: relations among many pixels
• x is Sufficient– Given the input x, the label is not ambiguous
(at least to people)– Entropy (c | x) 0
• Separator is a function of the input pixels• It must be nonlinear: interaction / relation among
pixels determines class assignment
![Page 15: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/15.jpg)
What’s the Best Separating Hyperplane?
+
-
-
-
-
-
-
-
+
+
+
+
+
![Page 16: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/16.jpg)
What’s the Best Separating Hyperplane?
+
-
-
-
-
-
-
-
+
+
+
+
+
![Page 17: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/17.jpg)
What’s the Best Separating Hyperplane?
+
-
-
-
-
-
-
-
+
+
+
+
+
![Page 18: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/18.jpg)
What’s the Best Separating Hyperplane?
+
-
-
-
-
-
-
-
+
+
+
+
+
Margin m
Can use the radius r of the smallest enclosing sphere
Capacity is related to (r/m)2Support Vectors
![Page 19: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/19.jpg)
Kernel Methods
• Map to a new higher dimensional space– Can be very high
– Can be infinite
• Kernel functions– Introduce high dimensionality
– Computation is independent of dimensionality
– Defined w/ dot product of input image vectors(information on the Cosine between image vectors)
• A kernel function defines a distance metric over space of example images
• Points not linearly separable: soft margin, margin distributions,…
![Page 20: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/20.jpg)
SVMs for Digit Images
• K(x,y) = (x y)3 or (x y + 1)3
• Dot product scalar; cube itConsider how this works…
• Before 322 features (or about 103)
• Now ~ (322)3 features (or about 109)
• New Feature = monomial = correlation among three pixels
• VC(lin sep) ~ # dimensions
• Overfitting problem? – Not if the margin is large
– Monitor number of support vectors
![Page 21: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/21.jpg)
Mercer’s Condition / Representer Theorem
• <Kernel matrix is positive semidefinite>• The desired hyperplane can be represented as
Linear weighted sum of distances to support vectors
• Kernel defines the distance metric• The hypothesis space is represented efficiently by
using some of the training examples – the support vectors
m
iiiK
1
),( xs
![Page 22: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/22.jpg)
Distinguishing Handwritten Seven’s vs. Two’s and Eight’s
Two’s
Eight’s
Seven’s
Handwritten 32 x 32 gray scale pixels
Input feature space is inappropriate
Map inputs to a high-dimensional space
Many more features; nonlinear combinations
Linearly separable in the new space
![Page 23: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/23.jpg)
Mercer Kernels
Usually start with a kernel rather than features
(s x)d Homogeneous polynomials
(s x + 1)d Complete polynomials
Exp(-||s – x||2 / 22 ) Gaussian / RBF
K + k
c K
K + c
K k
![Page 24: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/24.jpg)
ProblemsSVMs & statistical learning generally
• Little information from each training example– Signal must show through the noise– Need many training examples– Thousands of are needed for handwritten digits
• Much information is ignored (weak bias vocabulary)• Compare w/ humans
– Novel simple shape of similar complexity– Master with several tens (perhaps a hundred) training examples– Exceedingly small non-fatigue error rate
• Chinese characters are much more difficult than digits
![Page 25: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/25.jpg)
Two Related Classification Problems
1.2%
negligible
error
60000
< 100 ?
No. examples
SVMs
Humans
![Page 26: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/26.jpg)
Two Related Classification Problems
1.2%
negligible
error
60000
< 100 ?
No. examples
SVMs
Humans
a fixed permutatio
n over pixels
![Page 27: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/27.jpg)
Two Related Classification Problems
1.2%
50%
error
60000
NA
No. examples
1.2%
negligible
error
60000
< 100 ?
No. examples
SVMs
Humans
a fixed permutatio
n over pixels
To an SVM these are the same problemApparently the SVM ignores information crucial to
people
![Page 28: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/28.jpg)
Strokes Make the Difference• Explanatory hidden features
– Humans know that strokes mediate between pixels and class labels.
– Statistical machine learners find the pattern using pixel level inputs alone without knowing about strokes.
• What can this example tell us?
– Statistical learning algorithms are advanced enough to extract complex pattern from data.
– But simple prior knowledge (e.g., the existence of strokes) may help to find relevant patterns faster and more accurately.
• Inventing latent features is hard for statistics
![Page 29: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/29.jpg)
Domain Knowledge
• What can we say about strokes?– Within an image they are written by the same
person using the same writing instrument…
– They are made by a succession of simple pen movements…
– They give rise to the pixels…
– Much Information! (suppose it did not hold)
• This is not easily captured in the native bias vocabulary (not solution knowledge)
• Knowledge about strokes is imperfect so that building a bottom-up stroke extractor is error-prone.
![Page 30: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/30.jpg)
Primary Domain: Distinguishing Handwritten Chinese Characters
• More complex than digits or Western characters (64x63 pixels).
• Thousands of different characters Few training examples available for each (200 labeled images for us).
• Domain knowledge includes anideal prototype stroke representation for each character.
![Page 31: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/31.jpg)
Handwritten Chinese Characters
• We selected ten characters in three classes:
• Yields forty-five classification problems.
• Classification difficultyvaries significantly byclassification problem.
![Page 32: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/32.jpg)
Hough Transform
• Old (but good) idea• <x,y> <m,b> given y = mx + b• Hough transform makes a poor line detector • BUT Explaining is easy and reliable
(class label determines the ideal prototype stroke representation)
• We know the lines: – approximate parameters, – geometric constraints
• Find / hallucinate the Hough peaks to optimize the fit
![Page 33: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/33.jpg)
Feature Kernel Functions
• Design special-purpose kernel functions
• Adapt “distance” metric to fit the task
• Emphasize expected high-information content pixels
![Page 34: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/34.jpg)
Explaining Chinese Characters
• A pixel is judged to be informative if it is likely to be part of an informative stroke feature.
• Stroke features are informative if they are distinctive between the ideal prototype characters.
• Interaction between training examples and the prior domain knowledge is crucial.
![Page 35: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/35.jpg)
• From domain knowledge, the top and bottom horizontal strokes are unlikely to be informative.
• Explanation: apply a linear Hough transformation to identify lines in the image, and associate pixels in the images with strokes.
• Prototype stroke representations greatly aid in identifying the pixel – stroke correspondence in training examples (but not test examples).
• High information pixels correspond to distinctive stroke-level features
Constructing Explanations
五互
五五互
互
![Page 36: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/36.jpg)
What is an Explanation for the Feature Kernel Function Approach?
• An account of where the class information is expected to be found within the input image pixels
• Uniform emphasis over disk of 90% probability mass of the fitted Gaussian
![Page 37: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/37.jpg)
Experiments Feature kernel function vs conventional
(cubic polynomial SVM)
FKF: similar performance withnearly an order of magnitude less training
Performance by problemScatter Plot for 45 ProblemsAll problems improve; FKF never hurtsLower slope?(suggests hardest problems are helped most)
![Page 38: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/38.jpg)
Learning curves by problem difficulty (as judged by SVM accuracy)
A) Hardest B) Middle C) Easiest third
Experiments Feature kernel function vs conventional
(cubic polynomial SVM)
![Page 39: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/39.jpg)
Experiments Feature kernel function vs conventional
(cubic polynomial SVM)
For each problem at full trainingFKF always uses fewer support vectors
Interaction between prior knowledge and training examples is crucial
![Page 40: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/40.jpg)
Explanation-Augmented Support Vector Machine
• EA-SVM: another approach
• Previous approach adapted the kernel function
• EA-SVM alters the SVM algorithm; uses standard kernel function
• Explanations are integrated directly as a bias
![Page 41: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/41.jpg)
EA-SVMWhat is an Explanation?
• An explanation is a generalization of a training example, a proposed equivalence class of examples.
• Same explanation implies same label for the same reason, and should be treated the same by the classifier.
• For an SVM, examples with the same explanation should have the same margin.
• A perfect explanation is a hyperplane to which the classifier should be parallel
• Explanations are not perfect.
• So prefer a decision surface that ismore nearly parallel to confirmed explanations.
• Penalize non-parallelnessx1
x2 x3
example
constrain surface
x1
x2 x3
example
constrain surface
![Page 42: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/42.jpg)
Formalizing the Constraints Mathematically
• Let an explanation justify the label for a given example x using only a subset e of features, the explained example v is defined as:
The special symbol ‘*’ indicates that this feature does not participate in the inner product evaluation. With numerical features one can simply use the value zero.
• The constraints can be expressed as:
or equally:
• Geometrically, this requires the classifier hyper-plane to be parallel to the direction x – v.
otherwisev
exifxv
i
iii
,'*'
,
bb vwxw
0) vxw (
![Page 43: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/43.jpg)
EA-SVMs: Explanation-Augmented Support Vector Machines
• Incorporate high quality explanations into a conventional SVM
• Classifier reflects information from both examples and domain knowledge.
• Optimal classifier blends:– Maximal conventional margin to training examples– Maximally parallel to high quality explanations
• We use soft constraints for each.• Similar analyses using two sets of slack variables.• Linear blending via cross validation.
![Page 44: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/44.jpg)
The EA-SVM Optimization Problem
• Perfect knowledge:
• Imperfect knowledge:– Introduce positive new slack variables (i):
– The optimization problem become:
– K, the confidence parameter, is determined by cross-validation; it blends empirical and explanation information
ivwxw
ibxwytosubject
wnimizemi
ii
ii
,0
,01)(::21
: 2
iiiiiiii vwxwvwxwi 0,,
iivwxw
ivwxwibxwysub
KwMinimize
iiiii
iiiii
i i
,0;,
;,;,01)(:2
1:
2
![Page 45: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/45.jpg)
Solutions for EA-SVM• With perfect knowledge:
where
• With imperfect knowledge:
where
• When confidence parameter K goes to infinity, the second solution reduces to the same as the first one.
• When K and the i are 0, the problem ignores the
explanations and reduces to a standard SVM.
i i iiiiii vxxyw )(
,0i 0i ii y
i i iiiiii vxxyw )(
KKy ii iii ,0,0
![Page 46: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/46.jpg)
Formal Analysis: Why EA-SVM works
• EA-SVM algorithm minimizes the following error bound:
• Interesting symbols in the expression of h:– Rv : The radius of the ball that contains all the explained
examples. We expect Rv < R.
– D: The penalty of a separator <u,b> violates the parallel constrains imposed by explanations.
– is determined by cross-validation to minimize h.
![Page 47: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/47.jpg)
A Simple Prediction• A closer look at h:
• With perfect knowledge, D=0:
• Without knowledge:
• EA-SVM has most to offer when the ratio Rv /R is
small, which means explanations uses few important features to justify the label. Intuitively, the learning problem is difficult but the domain knowledge is informative.
2
2222V )/1)(R(5.64
D
h
22V /R5.64 h
22 /R5.64 h
![Page 48: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/48.jpg)
Experiment 1: Does Explanation-Augmentation Help?
SVM error
EA
-SV
M e
rro
r
00 0.04 0.08 0.12 0.16 0.2
0.04
0.08
0.12
0.16
0.2
SVM error
EA
-SV
M e
rro
r
00 0.04 0.08 0.12 0.16 0.2
00 0.04 0.08 0.12 0.16 0.20 0.04 0.08 0.12 0.16 0.2
0.04
0.08
0.12
0.16
0.2
0.04
0.08
0.12
0.16
0.2
Results for 45 classifiers on pairs of Chinese characters. Below the line means EA-SVM makes fewer errors than SVM.
![Page 49: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/49.jpg)
Experiment 2: Difficult Problems Benefit More
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
5 10 20 40 80 160
Training Size
Err
or
Rat
e
EA-SVM Easy
SVM Easy
EA-SVM Difficult
SVM Difficult
-0. 08
0. 5
0 0. 25 0. 5 0. 75
di ffi cul ty im
prov
emen
t1
0
0. 3
0. 2
0. 1
0. 4
-0. 08
0. 5
0 0. 25 0. 5 0. 75
di ffi cul ty im
prov
emen
t1
0
0. 3
0. 2
0. 1
0. 4
EA-SVM vs. SVMEasy tasks: SimilarDifficult tasks: EA-SVM wins at all training levels.
Task difficulty is highly correlated with Improvement of EA-SVM over conventional SVM.
![Page 50: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/50.jpg)
Exp 3: Robustness and the Effect of Knowledge Quality
random knowledge
SVM error
EA
-SV
M e
rro
r
0.04
0.08
0.12
0.16
0.2
00 0.04 0.08 0.12 0.16 0.2
SVM error
EA
-SV
M e
rro
r
00 0.04 0.08 0.12 0.16 0.2
0.04
0.08
0.12
0.16
0.2
expert knowledge
SVM error
EA
-SV
M e
rro
r
0
0.05
0.1
0.15
0.2
0 0.05 0.1 0.15 0.2
opposite knowledge random knowledge
SVM error
EA
-SV
M e
rro
r
0.04
0.08
0.12
0.16
0.2
00 0.04 0.08 0.12 0.16 0.2
random knowledge
SVM error
EA
-SV
M e
rro
r
0.04
0.08
0.12
0.16
0.2
0.04
0.08
0.12
0.16
0.2
00 0.04 0.08 0.12 0.16 0.20 0.04 0.08 0.12 0.16 0.2
SVM error
EA
-SV
M e
rro
r
00 0.04 0.08 0.12 0.16 0.2
0.04
0.08
0.12
0.16
0.2
expert knowledge
SVM error
EA
-SV
M e
rro
r
00 0.04 0.08 0.12 0.16 0.2
00 0.04 0.08 0.12 0.16 0.20 0.04 0.08 0.12 0.16 0.2
0.04
0.08
0.12
0.16
0.2
0.04
0.08
0.12
0.16
0.2
expert knowledge
SVM error
EA
-SV
M e
rro
r
0
0.05
0.1
0.15
0.2
0 0.05 0.1 0.15 0.2
opposite knowledge
SVM error
EA
-SV
M e
rro
r
0
0.05
0.1
0.15
0.2
0
0.05
0.1
0.15
0.2
0 0.05 0.1 0.15 0.20 0.05 0.1 0.15 0.2
opposite knowledge
EA-SVM benefits from good knowledge, and is not hurt by incorrect knowledge.
![Page 51: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/51.jpg)
Exp 4: Additional (Non-image) Domains.
• Protein Explanations: only known motif sequences are important for proteins’ categorization.
• Text Explanations: Only words related to the category label are important.
• ROC (protein) and F1 (text) scores show EA-SVM improvement.
0.6
0.7
0.8
0.9
1
0.6 0.7 0.8 0.9 1
F1_
EA-
SV
M
F1_SVM
0.4
0.5
0.6
0.7
0.8
0.9
1
0.4 0.5 0.6 0.7 0.8 0.9 1ROC_SVM
RO
C_E
A-S
VM
A. B.protein text
0.6
0.7
0.8
0.9
1
0.6 0.7 0.8 0.9 1
F1_
EA-
SV
M
F1_SVM
0.6
0.7
0.8
0.9
1
0.6 0.7 0.8 0.9 1
F1_
EA-
SV
M
F1_SVM
0.4
0.5
0.6
0.7
0.8
0.9
1
0.4 0.5 0.6 0.7 0.8 0.9 1ROC_SVM
RO
C_E
A-S
VM
0.4
0.5
0.6
0.7
0.8
0.9
1
0.4 0.5 0.6 0.7 0.8 0.9 1ROC_SVM
RO
C_E
A-S
VM
A. B.
0.6
0.7
0.8
0.9
1
0.6 0.7 0.8 0.9 1
F1_
EA-
SV
M
F1_SVM
0.6
0.7
0.8
0.9
1
0.6 0.7 0.8 0.9 1
F1_
EA-
SV
M
F1_SVM
0.4
0.5
0.6
0.7
0.8
0.9
1
0.4 0.5 0.6 0.7 0.8 0.9 1ROC_SVM
RO
C_E
A-S
VM
0.4
0.5
0.6
0.7
0.8
0.9
1
0.4 0.5 0.6 0.7 0.8 0.9 1ROC_SVM
RO
C_E
A-S
VM
A. B.protein text
0.6
0.7
0.8
0.9
1
0.6 0.7 0.8 0.9 1
F1_
EA-
SV
M
F1_SVM
0.6
0.7
0.8
0.9
1
0.6 0.7 0.8 0.9 1
F1_
EA-
SV
M
F1_SVM
0.4
0.5
0.6
0.7
0.8
0.9
1
0.4 0.5 0.6 0.7 0.8 0.9 1ROC_SVM
RO
C_E
A-S
VM
0.4
0.5
0.6
0.7
0.8
0.9
1
0.4 0.5 0.6 0.7 0.8 0.9 1ROC_SVM
RO
C_E
A-S
VM
A. B.
![Page 52: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/52.jpg)
Previous Work on Incorporating Knowledge into SVMs (Solution Knowledge)
• Incorporating transformation invariance into SVMs.– Virtual support vector (Schölkopf, 1996)– Invariant kernel function (Schölkopf, 2002)– Jittered SVM (DeCoste & Schölkopf, 2002)– Tangent propagation (Simard 1992, 1998)
• Locally-improved kernel function explores spatial locality property (Schölkopf, 1998)
• Convolutional networks (LeCun et al 1998, Simard et al 2003)
• Knowledge-based SVM and kernels incorporates prior rules. (Fung, Mangasarian & Shavlik, 2002, 2003; Mangasarian, Shavlik & Wild 2004)
• Extracting character high-level features from pixel representation. (Teow 2000, Shi 2003, Kadir 2004…)
![Page 53: Robustness through Prior Knowledge: Using Explanation-Based Learning to Distinguish Handwritten Chinese Characters Gerald DeJong Computer Science University](https://reader034.vdocuments.us/reader034/viewer/2022051819/551478c5550346d36e8b4598/html5/thumbnails/53.jpg)
Conclusion• Inductive learning algorithms can benefit from domain
knowledge.• This work illustrates a novel direction of using knowledge
by combining EBL ideas into a statistical learner.• With Domain Knowledge, the expert need not also be
expert in the learning algorithms.• The EBL components are extremely simple; more can be
done.• The role of Domain knowledge rather than Solution
Knowledge demands further study; this is an important and little-explored direction.
• Next step: IJCAI07 Poster Explanation-Based Feature ConstructionShiau Hong Lim