decision list ling 572 fei xia 1/12/06. outline basic concepts and properties case study
DESCRIPTION
Definitions A decision list (DL) is an ordered list of conjunctive rules. –Rules can overlap, so the order is important. A k-DL: the length of every rule is at most k. A decision tree determines an example’s class by using the first matched rule.TRANSCRIPT
Decision List
LING 572Fei Xia1/12/06
Outline
• Basic concepts and properties• Case study
Definitions
• A decision list (DL) is an ordered list of conjunctive rules.– Rules can overlap, so the order is important.
• A k-DL: the length of every rule is at most k.
• A decision tree determines an example’s class by using the first matched rule.
An example
A simple DL:1. If X1=v11 && X2=v21 then c1
2. If X2=v21 && X3=v34 then c2
Classify an example=(v11,v21,v34)
The DL is 2-DL.
Rivest’s paper
• It assumes that all attributes (including goal attribute) are binary.
• It shows DL is easily learnable from examples.
Assignment and formula
• Input attributes: x1, …, xn
• An assignment gives each input attribute a value (1 or 0): e.g., 10001
• A boolean formula (function) maps each assignment to a value (1 or 0):
3221 xxxx
• Two formulae are equivalent if they give the same value for same input.
• Total number of different formulae:
Classification problem: learn a formula given a partial table
CNF an DNF• Literal: • Term: conjunction (“and”) of literals• Clause: disjunction (“or”) of literals
• CNF (conjunctive normal form): the conjunction of clauses.
• DNF (disjunctive normal form): the disjunction of terms.
• k-CNF and k-DNF
ii xx ,
)()( 32541 xxxxx
32541 xxxxx
A slightly different definition of DT
• A decision tree (DT) is a binary tree where each internal node is labeled with a variable, and each leaf is labeled with 0 or 1.
• k-DT: the depth of a DT is at most k.
• A DT defines a boolean formula: look at the paths whose leaf node is 1.
• An example
Decision list
• A decision list is a list of pairs (f1, v1), …, (fr, vr),
fi are terms, and fr=true.
• A decision list defines a boolean function: given an assignment x, DL(x)=vj, where j
is the least index s.t. fj(x)=1.
Relations among different representations
• CNF, DNF, DT, DL
• k-CNF, k-DNF, k-DT, k-DL– For any k < n, k-DL is a proper superset of the
other three.– Compared to DT, DL has a simple structure,
but the complexity of the decisions allowed at each node is greater.
k-CNF and k-DNF are proper subsets of k-DL
• k-DNF is a subset of k-DL:– Each term t of a DNF is converted into a decision rule (t, 1).
• k-CNF is a subset of k-DL:– Every k-CNF is a complement of a k-DNF: k-CNF and k-DNF
are duals of each other.– The complement of a k-DL is also a k-DL.
• Neither k-CNF nor k-DNF is a subset of the other – Ex: 1-DNF: nxxx ...21
K-DT is a proper set of k-DL• K-DT is a subset of k-DNF
– Each leaf labeled with “1” maps to a term in k-DNF.
• K-DT is a subset of k-CNF– Each leaf labeled with “0” maps to a clause in k-
CNF
k-DT is a subset of
DNFkCNFk
K-DT, k-CNF, k-DNF and k-DT
k-CNF k-DNFk-DT
K-DL
Learnability
• Positive examples vs. negative examples of the concept being learned.– In some domains, positive examples are easier to
collect.
• A sample is a set of examples.
• A boolean function is consistent with a sample if it does not contradict any example in the sample.
Two properties of a learning algorithm
• A learning algorithm is economical if it requires few examples to identify the correct concept.
• A learning algorithm is efficient if it requires little computational effort to identify the correct concept.
We prefer algorithms that are both economical and efficient.
Hypothesis space
• Hypothesis space F: a set of concepts that are being considered.
• Hopefully, the concept being learned should be in the hypothesis space of a learning algorithm.
• The goal of a learning algorithm is to select the right concept from F given the training data.
• Discrepancy between two functions f and g:
• Ideally, we want to be as small as possible.
• To deal with ‘bad luck’ in drawing example according to Pn, we define a confidence parameter:
)()(|
)(xgxfxn xPgf
gf
1)( gfP
gf
“Polynomially learnable”• A set of Boolean functions is polynomially learnable if there
exists an algorithm A and a polynomial function
when given a sample of f of size drawn according to Pn, A will with probability at least
output a s.t.
Furthermore, A’s running time is polynomially bounded in n and m.
• K-DL is polynomially learnable.
nnn FfXonPntss ,,,,..),,( )1,1|,|(log
nFsm
nFg gf 1
How to build a decision list
• Decision tree Decision list
• Greedy, iterative algorithm that builds DLs directly.
The algorithm in (Rivest, 1987)
1. If the example set S is empty, halt.2. Examine each term of length k until a
term t is found s.t. all examples in S which make t true are of the same type v.
3. Add (t, v) to decision list and remove those examples from S.
4. Repeat 1-3.
The general greedy algorithm
• RuleList=[], E=training_data• Repeat until E is empty or gain is small
– f = Find_best_feature(E)– Let E’ be the examples covered by f– Let c be the most common class in E’– Add (f, c) to RuleList– E=E – E’
Problem of greedy algorithm• The interpretation of rules depends on preceding
rules.
• Each iteration reduces the number of training examples.
• Poor rule choices at the beginning of the list can significantly reduce the accuracy of DL learned.
Several papers on alternative algorithms
Summary of (Rivest, 1987)
• Formal definition of DL• Show the relation between k-DL, k-CNF,
k-DNF and k-DL.• Prove that k-DL is polynomially learnable.• Give a simple greedy algorithm to build k-
DL.
Outline
• Basic concepts and properties• Case study
In practice• Input attributes and the goal are not necessarily binary.
– Ex: the previous word
• A term a feature (it is not necessarily a conjunction of literals)– Ex: the word appears in a k-word window
• Only some feature types are considered, instead of all possible features:– Ex: previous word and next word
• Greedy algorithm: quality measure– Ex: a feature with minimum entropy
Case study: accent restoration• Task: to restore accents in Spanish and French A special case of WSD
• Ex: ambiguous de-accented forms:– cesse cesse, cessé– cote côté, côte, cote, coté
• Algorithm: build a DL for each ambiguous de-accented form: e.g., one for cesse, another one for cote
• Attributes: words within a window
The algorithm
• Training:– Find the list of de-accent forms that are
ambiguous.– For each ambiguous form, build a decision
list.
• Testing: check each word in a sentence– if it is ambiguous,
then restore the accent form according to the DL
Step 1: Identify forms that are ambiguous
Step 2: Collecting training context
Context: the previous three and next three words.Strip the accents from the data. Why?
Step 3: Measure collocational distributions
Feature types are pre-defined.
Collocations
Step 4: Rank decision rules by log-likelihood
word class
There are many alternatives.
Step 5: Pruning DLs• Pruning:
– Cross-validation– Remove redundant rules: “WEEKDAY” rule
precedes “domingo” rule.
Building DL• For a de-accented form w, find all possible
accented forms• Collect training contexts:
– collect k words on each side of w– strip the accents from the data
• Measure collocational distributions:– use pre-defined attribute combination:– Ex: “-1 w”, “+1w, +2w”
• Rank decision rules by log-likelihood • Optional pruning and interpolation
Experiments
Prior (baseline): choose the most common form.
Global probabilities vs. Residual probabilities
• Two ways to calculate the log-likelihood: – Global probabilities: using the full data set– Residual probabilities: using the residual
training data • More relevant, but less data and more expensive
to compute.
• Interpolation: use both• In practice, global probability works better.
Combining vs. Not combining evidence
• Each decision is based on a single piece of evidence. – Run-time efficiency and easy modeling– It works well, at least for this task, but why?
• Combining all available evidence rarely produces different results
• “The gross exaggeration of prob from combining all of these non-independent log-likelihood is avoided”:
Summary of case study
• It allows a wider context (compared to n-gram methods)
• It allows the use of multiple, highly non-independent evidence types (compared to Bayesian methods)
kitchen-sink approach of the best kind
Advance topics
Probabilistic DL
• DL: a rule is (f, v)
• Probabilistic DL: a rule is (f, v1/p1 v2/p2 … vn/pn)
Entropy of a feature q
fired not fired
)(log)(
||||
Sentropyppqentropy
SS
p
ii
i
ii
qT:
S1 S2 Sn…S: T-S:
Algorithms for building DL
• AQ algorithm (Michalski, 1969)• CN2 algorithm (Clark and Niblett, 1989)• Segal and Etzioni (1994)• Goodman (2002)• …
Summary of decision list• Rules are easily understood by humans (but remember
the order factor)
• DL tends to be relatively small, and fast and easy to apply in practice.
• DL is related to DT, CNF, DNF, and TBL.
• Learning: greedy algorithm and other improved algorithms
• Extension: probabilistic DL– Ex: if A & B then (c1, 0.8) (c2, 0.2)