op tim ization
DESCRIPTION
Op Tim IzationTRANSCRIPT
-
Optimization Methods in Data Mining
-
OverviewOptimizationMathematicalProgrammingCombinatorialOptimizationSupportVectorMachinesSteepestDescentSearchClassification,Clustering,etcNeural Nets,Bayesian Networks(optimize parameters)GeneticAlgorithmFeature selectionClassificationClustering
-
What is Optimization?FormulationDecision variablesObjective functionConstraintsSolutionIterative algorithmImproving searchProblemModelSolutionFormulationAlgorithm
-
Combinatorial OptimizationFinitely many solutions to choose fromSelect the best rule from a finite set of rulesSelect the best subset of attributesToo many solutions to consider allSolutionsBranch-and-bound (better than Weka exhaustive search)Random search
-
Random SearchSelect an initial solution x(0) and let k=0Loop:Consider the neighbors N(x(k)) of x(k)Select a candidate x from N(x(0)) Check the acceptance criterionIf accepted then let x(k+1) = x and otherwise let x(k+1) = x(k)Until stopping criterion is satisfied
-
Common AlgorithmsSimulated Annealing (SA)Idea: accept inferior solutions with a given probability that decreases as time goes onTabu Search (TS)Idea: restrict the neighborhood with a list of solutions that are tabu (that is, cannot be visited) because they were visited recentlyGenetic Algorithm (GA)Idea: neighborhoods based on genetic similarityMost used in data mining applications
-
Genetic AlgorithmsMaintain a population of solutions rather than a single solutionMembers of the population have certain fitness (usually just the objective)Survival of the fittest throughselectioncrossovermutation
-
GA FormulationUse binary strings (or bits) to encode solutions:0 1 1 0 1 0 0 1 0TerminologyChromosomes = solutionParent chromosomeChildren or offspring
-
Problems SolvedData Mining Problems that have been addressed using Genetic Algorithms:ClassificationAttribute selectionClustering
-
Classification ExampleOutlookSunny100
Overcast010
Rainy001
Yes10
No01Windy
-
Representing a RuleIf windy=yes then play=yesIf outlook=overcast and windy=yes then play=no
-
Single-Point CrossoverCrossover pointParents Offspring
-
Two-Point CrossoverCrossover pointsParents Offspring
-
Uniform CrossoverParents OffspringProblem?
-
MutationParent OffspringMutated bit
-
SelectionWhich strings in the population should be operated on?Rank and select the n fittest onesAssign probabilities according to fitness and select probabilistically, say
-
Creating a New PopulationCreate a population Pnew with p individualsSurvivalAllow individuals from old population to survived intactRate: 1-r % of populationHow to select the individuals that survive: Deterministic/randomCrossoverSelect fit individuals and create new onceRate: r% of population. How to select?MutationSlightly modify any on the above individualsMutation rate: mFixed number of mutations versus probabilistic mutations
-
GA AlgorithmRandomly generate an initial population PEvaluate the fitness f(xi) of each individual in PRepeat:Survival: Probabilistically select (1-r)p individuals from P and add to Pnew, according to
Crossover: Probabilistically select rp/2 pairs from P and apply the crossover operator. Add to PnewMutation: Uniformly choose m percent of member and invert one randomly selected bitUpdate: P PnewEvaluate: Compute the fitness f(xi) of each individual in PReturn the fittest individual from P
-
Analysis of GA: SchemasDoes GA converge?Does GA move towards a good solution? Local optima?
Holland (1975): Analysis based on schemasSchema: string combination of 0s, 1s, *sExample: 0*10 represents {0010,0110}
-
The Schema Theorem(all the theory on one slide)Number of instance ofschema s at time tAverage fitness ofindividuals in schema s at time tProbabilityof crossoverProbabilityof mutationNumber ofdefined bitsin schema sDistance betweendefined bits in s
-
InterpretationFit schemas grow in influenceWhat is missingCrossover?Mutation?How about time t+1 ?Other approaches:Markov chainsStatistical mechanics
-
GA for Feature SelectionFeature selection:Select a subset of attributes (features)Reason: to many, redundant, irrelevant
Set of all subsets of attributes very largeLittle structure to searchRandom search methods
-
EncodingNeed a bit code representationHave some n attributesEach attribute is either in (1) or out (0) of the selected set
-
FitnessWrapper approachApply learning algorithm, say a decision tree, to the individual x ={outlook, humidity}Let fitness equal error rate (minimize)Filter approachLet fitness equal the entropy (minimize)Other diversity measures can also be usedSimplicity measure?
-
CrossoverCrossover point
-
In Weka
-
Clustering ExampleCrossover{10,20}{30,40}{20,40}{10,30}{10,20,40}{30}{20}{10,30,40}Create two clusters for:
ID
Outlook
Temperature
Humidity
Windy
Play
10
Sunny
Hot
High
True
No
20
Overcast
Hot
High
False
Yes
30
Rainy
Mild
High
False
Yes
40
Rainy
Cool
Normal
False
Yes
-
DiscussionGA is a flexible and powerful random search methodologyEfficiency depends on how well you can encode the solutions in a way that will work with the crossover operatorIn data mining, attribute selection is the most natural application
-
Attribute Selection in Unsupervised LearningAttribute selection typically uses a measure, such as accuracy, that is directly related to the class attributeHow do we apply attribute selection to unsupervised learning such as clustering?Need a measurecompactness of clusterseparation among clustersMultiple measures
-
Quality MeasuresCompactnessCentroidInstancesClustersNumber ofattributesNormalizationconstant to make
-
More Quality MeasuresCluster Separation
-
Final Quality MeasuresAdjustment for bias
Compexity
-
Wrapper FrameworkLoop:Obtain an attribute subsetApply k-means algorithmEvaluate cluster qualityUntil stopping criterion satisfied
-
ProblemWhat is the optimal attribute subset?What is the optimal number of clusters?
Try to find simultaneously
-
ExampleFind an attribute subset and optimal number of clusters (Kmin = 2, Kmin = 3) for
ID
Sepal Length
Sepal Width
Petal length
Petal Width
10
5.0
3.5
1.6
0.6
20
5.1
3.8
1.9
0.4
30
4.8
3.0
1.4
0.3
40
5.1
3.8
1.6
0.2
50
4.6
3.2
1.4
0.2
60
6.5
2.8
4.6
1.5
70
5.7
2.8
4.5
1.3
80
6.3
3.3
4.7
1.6
90
4.9
2.4
3.3
1.0
100
6.6
2.9
4.6
1.3
-
FormulationDefine an individual
Initial Population0 1 0 1 11 0 0 1 0
-
Evaluate FitnessStart with 0 1 0 1 1Three clusters and {Sepal Width, Petal Width}
Apply k-means with k=3
ID
Sepal Width
Petal Width
10
3.5
0.6
20
3.8
0.4
30
3.0
0.3
40
3.8
0.2
50
3.2
0.2
60
2.8
1.5
70
2.8
1.3
80
3.3
1.6
90
2.4
1.0
100
2.9
1.3
-
K-MeansStart with random centroids: 10, 70, 80
Chart1
0.6
0.4
0.3
0.2
0.2
1.5
1.3
1.6
1
1.3
Sepal Width
Petal Width
90
30
50
40
20
10
80
60
70
100
Sheet1
103.50.6
203.80.4
3030.3
403.80.2
503.20.2
602.81.5
702.81.3
803.31.6
902.41
1002.91.3
Sheet1
Sepal Width
Petal Width
100
70
60
80
10
20
40
50
30
90
Sheet2
Sheet3
-
New CentroidsNo change in assignment soterminate k-means algorithm
Chart3
0.6
0.4
0.3
0.2
0.2
1.5
1.3
1.6
1
1.3
1.275
0.34
Sepal Width
Petal Width
90
30
50
40
20
10
80
60
70
100
C3
C1
Sheet1
103.50.6
203.80.4
3030.3
403.80.2
503.20.2
602.81.5
702.81.3
803.31.6
902.41
1002.91.3
C12.7251.275
C2
C33.460.34
Sheet1
Sepal Width
Petal Width
C1
C3
100
70
60
80
10
20
40
50
30
90
Sheet2
Sheet3
-
Quality of ClustersCentersCenter 1 at (3.46,0.34): {60,70,90,100}Center 2 at (3.30,1.60): {80}Center 3 at (2.73,1.28): {10,20,30,40,50}Evaluation
-
Next IndividualNow look at 1 0 0 1 0Two clusters and {Sepal Length, Petal Width}
Apply k-means with k=3
ID
Sepal Length
Petal Width
10
5.0
0.6
20
5.1
0.4
30
4.8
0.3
40
5.1
0.2
50
4.6
0.2
60
6.5
1.5
70
5.7
1.3
80
6.3
1.6
90
4.9
1.0
100
6.6
1.3
-
K-MeansSay we select 20 and 90 as initial centroids:
Chart4
0.6
0.4
0.3
0.2
0.2
1.5
1.3
1.6
1
1.3
Sepal Width
Petal Width
90
30
50
40
20
10
80
60
70
100
C3
C1
Sheet1
1050.6
205.10.4
304.80.3
405.10.2
504.60.2
606.51.5
705.71.3
806.31.6
904.91
1006.61.3
C15.9251.275
C2
C34.920.34
Sheet1
0
0
0
0
0
0
0
0
0
0
Sepal Width
Petal Width
C1
C3
100
70
60
80
10
20
40
50
30
90
Sheet2
Sheet3
-
Recalculate Centroids
Chart6
0.6
0.4
0.3
0.2
0.2
1.5
1.3
1.6
1
1.3
0.34
1.34
Sepal Width
Petal Width
90
30
50
40
20
10
80
60
70
100
C3
C1
C2
Sheet1
1050.6
205.10.4
304.80.3
405.10.2
504.60.2
606.51.5
705.71.3
806.31.6
904.91
1006.61.3
C14.920.34
C261.34
C34.920.34
Sheet1
0
0
0
0
0
0
0
0
0
0
0
0
Sepal Width
Petal Width
C2
C1
C3
100
70
60
80
10
20
40
50
30
90
Sheet2
Sheet3
-
Recalculate AgainNo change in assignment soterminate k-means algorithm
-
Quality of ClustersCentersCenter 1 at (4.92,0.45): {10,20,30,40,50,90}Center 3 at (6.28,1.43): {60,70,90,100} Evaluation
-
Compare IndividualsWhich is fitter?
-
Evaluating FitnessCan scale (if necessary)Then weight them together, e.g.,
Alternatively, we can use Pareto optimization
-
Mathematical ProgrammingContinuous decision variablesConstrained versus non-constrainedForm of the objective functionLinear Programming (LP)Quadratic Programming (QP)General Mathematical Programming (MP)
-
Linear Program
-
Two Dimensional ProblemOptimum isalways at anextreme point
-
Simplex Method
-
Quadratic Programming
Chart1
1.2
1.01
0.84
0.69
0.56
0.45
0.36
0.29
0.24
0.21
0.2
0.21
0.24
0.29
0.36
0.45
0.56
0.69
0.84
1.01
1.2
f(x)=0.2+(x-1)2
Sheet1
01.2
0.11.01
0.20.84
0.30.69
0.40.56
0.50.45
0.60.36
0.70.29
0.80.24
0.90.21
10.2
1.10.21
1.20.24
1.30.29
1.40.36
1.50.45
1.60.56
1.70.69
1.80.84
1.91.01
21.2
Sheet1
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
f(x)=0.2+(x-1)2
Sheet2
Sheet3
-
General MPDerivative beingzero is a necessarybut not sufficientcondition
-
Constrained Problem?
-
General MPWe write a general mathematical program in matrix notation as:
-
Karush-Kuhn-Tucker (KKT) Conditions
-
Convex SetsA set C is convex if any line connecting two points in theset lies completely within the set, that is,
-
Convex HullThe convex hull co(S) of a set S is the intersection of all convex sets containing SA set V Rn is a linear variety if
-
HyperplaneA hyperplane in Rn is a (n-1)-dimensional variety
-
Convex Hull ExamplePlay
No PlayTemperatureHumidity
-
Finding the Closest PointsFormulate as QP:
-
Support Vector MachinesPlay
No PlayTemperatureHumidity
-
Example
ID
Sepal Width
Petal Width
10
3.5
0.6
20
3.8
0.4
30
3.0
0.3
40
3.8
0.2
50
3.2
0.2
60
2.8
1.5
70
2.8
1.3
80
3.3
1.6
90
2.4
1.0
100
2.9
1.3
-
Separating Hyperplane
Chart1
0.6
0.4
0.3
0.2
0.2
1.5
1.3
1.6
1
1.3
Sheet1
3.50.6
3.80.4
30.3
3.80.2
3.20.2
2.81.5
2.81.3
3.31.6
2.41
2.91.3
Sheet1
Sheet2
Sheet3
-
Assume Separating PlanesConstraints:
Distance to each plane:
-
Optimization Problem
-
How Do We Solve MPs?
-
Improving SearchDirection-step approach
New Solution
-
Steepest DescentSearch direction equal to negative gradient
Finding l is a one-dimensional optimization problem of minimizing
-
Newtons MethodTaylor series expansion
The right hand side is minimized at
-
DiscussionComputing the inverse Hessian is difficultQuasi-Newton
Conjugate gradient methodsDoes not account for constraintsPenalty methodsLagrangian methods, etc.
-
Non-separableAdd an error term to the constraints:
-
Wolfe DualSimpleconstraintsOnly placedata appears
-
Extension to Non-LinearKernel functions
MappingTakes place ofdot product inWolfe dualHigh dimensionalHilbert space
-
Some Possible Kernels
-
In WekaWeka.classifiers.smoSupport vector machine for nominal data onlyDoes both linear and non-linear models
-
Optimization in DMOptimization
-
Bayesian ClassificationNave Bayes assumes independence between attributesSimple computationsBest classifier if assumption is trueBayesian Belief NetworksJoint probability distributionsDirected acyclic graphsNodes are random variables (attributes)Arcs represent the dependencies
-
Example: Bayesian NetworkFamily HistorySmokerLung CancerEmphysemaPositive X-RayDyspnea
-
Conditional ProbabilitiesRandomvariableOutcome of therandom variableThe node representingthe class attribute iscalled the output node
-
How Do we Learn?Network structureGiven/knownInferred or learned from the dataVariablesObservableHidden (missing values / incomplete data)
-
Case 1: Known Structure and Observable VariablesStraightforwardSimilar to Nave BayesCompute the entries of the conditional probability table (CPT) of each variable
-
Case 2: Known Structure and Some Hidden VariablesStill need to learn the CPT entriesLet S be a set of s training instances
Let wijk be the CPT entry for variable Yi=yij having parents Ui=uik.
-
CPT Example
FH,S
FH,~S
~FH,S
~FH,~S
LC
0.8
0.5
0.7
0.1
~LC
0.2
0.5
0.3
0.9
-
ObjectiveMust find the value of
The objective is to maximize the likelihood of the data, that is,
How do we do this?
-
Non-Linear MPCompute gradients:
Move in the direction of the gradientFrom training dataLearning rate
-
Case 3: Unknown Network StructureNeed to find/learn the optimal network structure for the data
What type of optimization problem is this?Combinatorial optimization (GA etc.)