randomized variable elimination david j. stracuzzi paul e. utgoff
Post on 22-Dec-2015
218 views
TRANSCRIPT
Agenda
• Background • Filter and wrapper methods• Randomized Variable Elimination• Cost Function• RVE algorithm when r is known (RVE)• RVE algorithm when r is not known (RVErS)• Results• Questions
Variable Selection Problem• Choosing relevant attributes from set of attributes.• Producing a subset of variables from large set of input variables
that best predicts target function.• Forward selection algorithm starts with an empty set and searches
for variables to add.• Backward selection algorithm starts with entire set of variables and
go on removing irrelevant variable(s).• In some cases, forward selection algorithm also removes variables
in order to recover from previous poor selections.• Caruna and Freitag (1994) experimented with greedy search
methods and found that allowing search to add or remove variables outperform simple forward and backward searches
• Filter and wrapper methods for variable selection.
Filter methods• Uses statistical measures to evaluate the quality of variable
subsets.• Subset of variables are evaluated with respect to specific
quality measure.• Statistical evaluation of variables require very little
computational cost as compared to running the learning algorithm.
• FOCUS (Almuallim and Dietterich, 1991) searches for smallest subset that completely discriminates between target classes.
• Relief (Kira and Rendell, 1992) ranks variables as per distance.• In filter methods, variables are evaluated independently and
not in context of learning problem.
Wrapper methods
• Uses performance of the learning algorithm to evaluate the quality of subset of input variables.
• The learning algorithm is executed on the candidate variable set and then tested for the accuracy of resulting hypothesis.
• Advantage: Since wrapper methods evaluate variables in the context of learning problem, they outperform filter methods.
• Disadvantage: Cost of repeatedly executing the learning algorithm can become problematic.
• John, Kohavi, and Pfleger (1994) coined the term “wrapper” but the technique was used before that (Devijver and Kittler, 1982)
Randomized Variable Elimination
• Falls under the category of wrapper methods.• First, a hypothesis is produced for entire set of ‘n’ variables.• A subset if formed by randomly selecting ‘k’ variables.• A hypothesis is then produced for remaining (n-k) variables.• Accuracy of the two hypotheses are compared.• Removal of any relevant variable should cause an immediate
decline in performance• Uses a cost function to achieve a balance between successive
failures and cost of running the learning algorithm several times.
Probability of selecting ‘k’ variables
• The probability of successfully selecting ‘k’ irrelevant variables at random is given by
where, n … remaining variables r … relevant variables
1
0
),,(k
i in
irnkrnp
Expected number of failures
• The expected number of consecutive failures before a success at selecting k irrelevant variables is given by
• Number of consecutive trials in which at least one of the r relevant variables will be randomly selected along with irrelevant variables.
),,(
),,(1),,(
krnp
krnpkrnE
Cost of removing k variables
• The expected cost of successfully removing k variables from n remaining given r relevant variables is given by
where, M(L, n) represents an upper bound on
the cost of running algorithm ‘L’ on n inputs.
),(),(),,(),,( knLMknLMkrnEkrnI
)1),,()(,( krnEknLM
Optimal cost of removing irrelevant variables
• The optimal cost of removing irrelevant variables from n remaining and r relevant is given by
)),(),,((min),( rknIkrnIrnI sumk
sum
Optimal value for ‘k’
• The optimal value is computed as
• It is the value of k for which the cost of removing variables is optimal.
)),(),,((minarg),( rknIkrnIrnk sumk
opt
),( rnkopt
Algorithm for computing k and cost values
• Given: L, N, r• Isum[r+1…N] ← 0 kopt[r+1…N] ← 0
for i ← r+1 to N do bestCost ← ∞ for k ← 1 to i-r do temp ← I(i,r,k) + Isum[i-k] if (temp < bestCost) then bestCost ← temp bestK ← k Isum[i] ← bestCost kopt[i] ← bestK
Randomized Variable Elimination (RVE) when r is known
• Given: L,n,r, tolerance• Compute tables for Isum(i,r) and kopt(i,r) h ← hypothesis produced by L on ‘n’ inputs
• while n > r do k ← kopt(n,r) select k variables at random and remove them h’ ← hypothesis produced by L on n-k inputs if e(h’) – e(h) ≤ tolerance then n ← n-k
h ← h’ else
replace the selected k variables
RVE example
• Plot of expected cost of running RVE(Isum(N,r = 10)) along with cost of removing inputs individually, and the estimated number of updates M(L,n).
• L is function that learns a boolean function using perceptron unit.
Randomized Variable Elimination including a search for ‘r’ (RVErS)
• Given: L, c1, c2, n, rmax , rmin , tolerance
• Compute tables Isum(i,r) and kopt(i,r) for rmin ≤ r ≤ rmax
r ← (rmax + rmin) / 2
success, fail ← 0 h ← hypothesis produced by L on ‘n’ inputs
• repeat k ← kopt(n,r)
select k variables at random and remove them h’ ← hypothesis produced by L on (n-k) inputs
if e(h’) – e(h) ≤ tolerance then n ← n – k h ← h’ success ← success + 1 fail ← 0
else replace the selected k variables fail ← fail + 1
success ← 0
RVErS (contd…) if n ≤ rmin then
r, rmax, rmin ← n
else if fail ≥ c1E⁻(n,r,k) then
rmin ← r
r ← (rmax + rmin) / 2
success, fail ← 0 else if success ≥ c2(r – E⁻(n,r,k)) then
rmax ← r
r ← (rmax + rmin) / 2
success, fail ← 0 until rmin < rmax and fail ≤ c1E⁻(n,r,k)
Variable Selection results using naïve Bayes and C4.5 algorithms
Data SetLearning Algorithm
Selection Algorithm Iters
Subset Evals Inputs
Percent Error
Time (sec) Search Cost
150 30.3 +- 3.0 0.09 275000r_max = 25 127 172 22.7 26.9 +- 3.9 19 39700000r_max = 75 293 293 17.4 26.0 +- 3.3 50 109000000r_max = 150 434 434 25.6 25.9 +- 2.6 86 202000000k = 1 423 423 23.7 27.0 +- 2.1 85 204000000forward 14 2006 13 26.6 +- 2.9 141 154000000backward 14 1870 138 30.1 +- 2.6 667 1950000000filter 150 150 23.7 27.1 +- 2.1 34 84900000
150 43.9 +- 4.5 0.5 54800r_max = 25 85 85 51.1 42.0 +- 3.0 89 10100000r_max = 75 468 468 25.8 42.5 +- 4.5 363 37000000r_max = 150 541 541 25.2 40.8 +- 5.7 440 44800000k = 1 510 510 32.4 42.5 +- 2.7 439 952000forward 9 1286 7.8 27.0 +- 3.2 196 133000000backward 61 7218 90.9 43.5 +- 3.5 11481 133000000filter 150 150 7.1 27.3 +- 3.5 156 16900000
Bayes
C4.5
LED
Variable Selection results using naïve Bayes and C4.5 algorithms
Data SetLearning Algorithm
Selection Algorithm Iters
Subset Evals Inputs
Percent Error
Time (sec) Search Cost
35 7.8 +- 2.4 0.02 24000r_max = 15 142 142 12.6 8.9 +- 4.2 5.9 6470000r_max = 25 135 135 11.9 10.5 +- 5.8 5.8 6260000r_max = 35 132 132 11.2 9.8 +- 5.1 5.8 6170000k = 1 88 88 12.3 9.6 +- 5.0 4.6 4670000forward 13 382 12.3 7.3 +- 2.9 8.9 10900000backward 19 472 18 7.9 +- 4.6 37.5 36300000filter 35 35 31.3 7.8 +- 2.6 2 1970000
35 8.6 +- 4.0 0.04 1210r_max = 15 118 118 16.3 9.5 +- 4.6 13.5 278000r_max = 25 158 158 14.7 10.1 +- 4.1 18.6 190000r_max = 35 139 139 16.3 9.1 +- 3.7 17.3 386000k = 1 117 117 16.1 9.3 +- 3.5 14.9 352000forward 16 435 14.8 9.1 +- 4.0 33.7 322000backward 18 455 19.1 10.4 +- 4.4 69 1750000filter 35 35 30.8 8.5 +- 3.7 3.7 60600
Bayes
C4.5
soybean
My implementation
• Integrate with Weka• Extend the NaiveBayes and J48 algorithms• Obtain results for some UCI datasets used• Compare results with those reported by
authors• Work in progress
References
• H. Almuallim and T.G Dietterich. Leraning with many irrelevant features. In Proceedings of the Ninth National Conference on Artificial Intelligence, Anaheim, CA, 1991. MIT Press.
• R. Caruna and D. Freitag. Greedy attribute selection. In Machine Learning: Proceedings of Eleventh International Conference, Amherst, MA, 1993. Morgan Kaufmann.
• K. Kira and L. Rendell. A practical approach to feature selection. In D. Sleeman and P. Edwards, editors, Machine Learning: Proceedings of Ninth International Conference, San Mateo, CA, 1992. Morgan Kaufmann.
References (contd…)
• G. H. John, R. Kohavi, and K. Pfleger. Irrelevant features and subset selection problem. In Machine Learning: Proceedings of Eleventh Internaltional Conference, pages 121-129, New Brunswick, NJ, 1994. Morgan Kauffmann.
• P.A. Devijver and J. Kittler. Pattern Recognition: A statistical approach. Prentice Hall/International, 1982