op tim ization

88
Fall 2004 1 Optimization Methods in Data Mining

Upload: umeshkathuria

Post on 01-Oct-2015

226 views

Category:

Documents


0 download

DESCRIPTION

Op Tim Ization

TRANSCRIPT

  • Optimization Methods in Data Mining

  • OverviewOptimizationMathematicalProgrammingCombinatorialOptimizationSupportVectorMachinesSteepestDescentSearchClassification,Clustering,etcNeural Nets,Bayesian Networks(optimize parameters)GeneticAlgorithmFeature selectionClassificationClustering

  • What is Optimization?FormulationDecision variablesObjective functionConstraintsSolutionIterative algorithmImproving searchProblemModelSolutionFormulationAlgorithm

  • Combinatorial OptimizationFinitely many solutions to choose fromSelect the best rule from a finite set of rulesSelect the best subset of attributesToo many solutions to consider allSolutionsBranch-and-bound (better than Weka exhaustive search)Random search

  • Random SearchSelect an initial solution x(0) and let k=0Loop:Consider the neighbors N(x(k)) of x(k)Select a candidate x from N(x(0)) Check the acceptance criterionIf accepted then let x(k+1) = x and otherwise let x(k+1) = x(k)Until stopping criterion is satisfied

  • Common AlgorithmsSimulated Annealing (SA)Idea: accept inferior solutions with a given probability that decreases as time goes onTabu Search (TS)Idea: restrict the neighborhood with a list of solutions that are tabu (that is, cannot be visited) because they were visited recentlyGenetic Algorithm (GA)Idea: neighborhoods based on genetic similarityMost used in data mining applications

  • Genetic AlgorithmsMaintain a population of solutions rather than a single solutionMembers of the population have certain fitness (usually just the objective)Survival of the fittest throughselectioncrossovermutation

  • GA FormulationUse binary strings (or bits) to encode solutions:0 1 1 0 1 0 0 1 0TerminologyChromosomes = solutionParent chromosomeChildren or offspring

  • Problems SolvedData Mining Problems that have been addressed using Genetic Algorithms:ClassificationAttribute selectionClustering

  • Classification ExampleOutlookSunny100

    Overcast010

    Rainy001

    Yes10

    No01Windy

  • Representing a RuleIf windy=yes then play=yesIf outlook=overcast and windy=yes then play=no

  • Single-Point CrossoverCrossover pointParents Offspring

  • Two-Point CrossoverCrossover pointsParents Offspring

  • Uniform CrossoverParents OffspringProblem?

  • MutationParent OffspringMutated bit

  • SelectionWhich strings in the population should be operated on?Rank and select the n fittest onesAssign probabilities according to fitness and select probabilistically, say

  • Creating a New PopulationCreate a population Pnew with p individualsSurvivalAllow individuals from old population to survived intactRate: 1-r % of populationHow to select the individuals that survive: Deterministic/randomCrossoverSelect fit individuals and create new onceRate: r% of population. How to select?MutationSlightly modify any on the above individualsMutation rate: mFixed number of mutations versus probabilistic mutations

  • GA AlgorithmRandomly generate an initial population PEvaluate the fitness f(xi) of each individual in PRepeat:Survival: Probabilistically select (1-r)p individuals from P and add to Pnew, according to

    Crossover: Probabilistically select rp/2 pairs from P and apply the crossover operator. Add to PnewMutation: Uniformly choose m percent of member and invert one randomly selected bitUpdate: P PnewEvaluate: Compute the fitness f(xi) of each individual in PReturn the fittest individual from P

  • Analysis of GA: SchemasDoes GA converge?Does GA move towards a good solution? Local optima?

    Holland (1975): Analysis based on schemasSchema: string combination of 0s, 1s, *sExample: 0*10 represents {0010,0110}

  • The Schema Theorem(all the theory on one slide)Number of instance ofschema s at time tAverage fitness ofindividuals in schema s at time tProbabilityof crossoverProbabilityof mutationNumber ofdefined bitsin schema sDistance betweendefined bits in s

  • InterpretationFit schemas grow in influenceWhat is missingCrossover?Mutation?How about time t+1 ?Other approaches:Markov chainsStatistical mechanics

  • GA for Feature SelectionFeature selection:Select a subset of attributes (features)Reason: to many, redundant, irrelevant

    Set of all subsets of attributes very largeLittle structure to searchRandom search methods

  • EncodingNeed a bit code representationHave some n attributesEach attribute is either in (1) or out (0) of the selected set

  • FitnessWrapper approachApply learning algorithm, say a decision tree, to the individual x ={outlook, humidity}Let fitness equal error rate (minimize)Filter approachLet fitness equal the entropy (minimize)Other diversity measures can also be usedSimplicity measure?

  • CrossoverCrossover point

  • In Weka

  • Clustering ExampleCrossover{10,20}{30,40}{20,40}{10,30}{10,20,40}{30}{20}{10,30,40}Create two clusters for:

    ID

    Outlook

    Temperature

    Humidity

    Windy

    Play

    10

    Sunny

    Hot

    High

    True

    No

    20

    Overcast

    Hot

    High

    False

    Yes

    30

    Rainy

    Mild

    High

    False

    Yes

    40

    Rainy

    Cool

    Normal

    False

    Yes

  • DiscussionGA is a flexible and powerful random search methodologyEfficiency depends on how well you can encode the solutions in a way that will work with the crossover operatorIn data mining, attribute selection is the most natural application

  • Attribute Selection in Unsupervised LearningAttribute selection typically uses a measure, such as accuracy, that is directly related to the class attributeHow do we apply attribute selection to unsupervised learning such as clustering?Need a measurecompactness of clusterseparation among clustersMultiple measures

  • Quality MeasuresCompactnessCentroidInstancesClustersNumber ofattributesNormalizationconstant to make

  • More Quality MeasuresCluster Separation

  • Final Quality MeasuresAdjustment for bias

    Compexity

  • Wrapper FrameworkLoop:Obtain an attribute subsetApply k-means algorithmEvaluate cluster qualityUntil stopping criterion satisfied

  • ProblemWhat is the optimal attribute subset?What is the optimal number of clusters?

    Try to find simultaneously

  • ExampleFind an attribute subset and optimal number of clusters (Kmin = 2, Kmin = 3) for

    ID

    Sepal Length

    Sepal Width

    Petal length

    Petal Width

    10

    5.0

    3.5

    1.6

    0.6

    20

    5.1

    3.8

    1.9

    0.4

    30

    4.8

    3.0

    1.4

    0.3

    40

    5.1

    3.8

    1.6

    0.2

    50

    4.6

    3.2

    1.4

    0.2

    60

    6.5

    2.8

    4.6

    1.5

    70

    5.7

    2.8

    4.5

    1.3

    80

    6.3

    3.3

    4.7

    1.6

    90

    4.9

    2.4

    3.3

    1.0

    100

    6.6

    2.9

    4.6

    1.3

  • FormulationDefine an individual

    Initial Population0 1 0 1 11 0 0 1 0

  • Evaluate FitnessStart with 0 1 0 1 1Three clusters and {Sepal Width, Petal Width}

    Apply k-means with k=3

    ID

    Sepal Width

    Petal Width

    10

    3.5

    0.6

    20

    3.8

    0.4

    30

    3.0

    0.3

    40

    3.8

    0.2

    50

    3.2

    0.2

    60

    2.8

    1.5

    70

    2.8

    1.3

    80

    3.3

    1.6

    90

    2.4

    1.0

    100

    2.9

    1.3

  • K-MeansStart with random centroids: 10, 70, 80

    Chart1

    0.6

    0.4

    0.3

    0.2

    0.2

    1.5

    1.3

    1.6

    1

    1.3

    Sepal Width

    Petal Width

    90

    30

    50

    40

    20

    10

    80

    60

    70

    100

    Sheet1

    103.50.6

    203.80.4

    3030.3

    403.80.2

    503.20.2

    602.81.5

    702.81.3

    803.31.6

    902.41

    1002.91.3

    Sheet1

    Sepal Width

    Petal Width

    100

    70

    60

    80

    10

    20

    40

    50

    30

    90

    Sheet2

    Sheet3

  • New CentroidsNo change in assignment soterminate k-means algorithm

    Chart3

    0.6

    0.4

    0.3

    0.2

    0.2

    1.5

    1.3

    1.6

    1

    1.3

    1.275

    0.34

    Sepal Width

    Petal Width

    90

    30

    50

    40

    20

    10

    80

    60

    70

    100

    C3

    C1

    Sheet1

    103.50.6

    203.80.4

    3030.3

    403.80.2

    503.20.2

    602.81.5

    702.81.3

    803.31.6

    902.41

    1002.91.3

    C12.7251.275

    C2

    C33.460.34

    Sheet1

    Sepal Width

    Petal Width

    C1

    C3

    100

    70

    60

    80

    10

    20

    40

    50

    30

    90

    Sheet2

    Sheet3

  • Quality of ClustersCentersCenter 1 at (3.46,0.34): {60,70,90,100}Center 2 at (3.30,1.60): {80}Center 3 at (2.73,1.28): {10,20,30,40,50}Evaluation

  • Next IndividualNow look at 1 0 0 1 0Two clusters and {Sepal Length, Petal Width}

    Apply k-means with k=3

    ID

    Sepal Length

    Petal Width

    10

    5.0

    0.6

    20

    5.1

    0.4

    30

    4.8

    0.3

    40

    5.1

    0.2

    50

    4.6

    0.2

    60

    6.5

    1.5

    70

    5.7

    1.3

    80

    6.3

    1.6

    90

    4.9

    1.0

    100

    6.6

    1.3

  • K-MeansSay we select 20 and 90 as initial centroids:

    Chart4

    0.6

    0.4

    0.3

    0.2

    0.2

    1.5

    1.3

    1.6

    1

    1.3

    Sepal Width

    Petal Width

    90

    30

    50

    40

    20

    10

    80

    60

    70

    100

    C3

    C1

    Sheet1

    1050.6

    205.10.4

    304.80.3

    405.10.2

    504.60.2

    606.51.5

    705.71.3

    806.31.6

    904.91

    1006.61.3

    C15.9251.275

    C2

    C34.920.34

    Sheet1

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    Sepal Width

    Petal Width

    C1

    C3

    100

    70

    60

    80

    10

    20

    40

    50

    30

    90

    Sheet2

    Sheet3

  • Recalculate Centroids

    Chart6

    0.6

    0.4

    0.3

    0.2

    0.2

    1.5

    1.3

    1.6

    1

    1.3

    0.34

    1.34

    Sepal Width

    Petal Width

    90

    30

    50

    40

    20

    10

    80

    60

    70

    100

    C3

    C1

    C2

    Sheet1

    1050.6

    205.10.4

    304.80.3

    405.10.2

    504.60.2

    606.51.5

    705.71.3

    806.31.6

    904.91

    1006.61.3

    C14.920.34

    C261.34

    C34.920.34

    Sheet1

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    Sepal Width

    Petal Width

    C2

    C1

    C3

    100

    70

    60

    80

    10

    20

    40

    50

    30

    90

    Sheet2

    Sheet3

  • Recalculate AgainNo change in assignment soterminate k-means algorithm

  • Quality of ClustersCentersCenter 1 at (4.92,0.45): {10,20,30,40,50,90}Center 3 at (6.28,1.43): {60,70,90,100} Evaluation

  • Compare IndividualsWhich is fitter?

  • Evaluating FitnessCan scale (if necessary)Then weight them together, e.g.,

    Alternatively, we can use Pareto optimization

  • Mathematical ProgrammingContinuous decision variablesConstrained versus non-constrainedForm of the objective functionLinear Programming (LP)Quadratic Programming (QP)General Mathematical Programming (MP)

  • Linear Program

  • Two Dimensional ProblemOptimum isalways at anextreme point

  • Simplex Method

  • Quadratic Programming

    Chart1

    1.2

    1.01

    0.84

    0.69

    0.56

    0.45

    0.36

    0.29

    0.24

    0.21

    0.2

    0.21

    0.24

    0.29

    0.36

    0.45

    0.56

    0.69

    0.84

    1.01

    1.2

    f(x)=0.2+(x-1)2

    Sheet1

    01.2

    0.11.01

    0.20.84

    0.30.69

    0.40.56

    0.50.45

    0.60.36

    0.70.29

    0.80.24

    0.90.21

    10.2

    1.10.21

    1.20.24

    1.30.29

    1.40.36

    1.50.45

    1.60.56

    1.70.69

    1.80.84

    1.91.01

    21.2

    Sheet1

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    0

    f(x)=0.2+(x-1)2

    Sheet2

    Sheet3

  • General MPDerivative beingzero is a necessarybut not sufficientcondition

  • Constrained Problem?

  • General MPWe write a general mathematical program in matrix notation as:

  • Karush-Kuhn-Tucker (KKT) Conditions

  • Convex SetsA set C is convex if any line connecting two points in theset lies completely within the set, that is,

  • Convex HullThe convex hull co(S) of a set S is the intersection of all convex sets containing SA set V Rn is a linear variety if

  • HyperplaneA hyperplane in Rn is a (n-1)-dimensional variety

  • Convex Hull ExamplePlay

    No PlayTemperatureHumidity

  • Finding the Closest PointsFormulate as QP:

  • Support Vector MachinesPlay

    No PlayTemperatureHumidity

  • Example

    ID

    Sepal Width

    Petal Width

    10

    3.5

    0.6

    20

    3.8

    0.4

    30

    3.0

    0.3

    40

    3.8

    0.2

    50

    3.2

    0.2

    60

    2.8

    1.5

    70

    2.8

    1.3

    80

    3.3

    1.6

    90

    2.4

    1.0

    100

    2.9

    1.3

  • Separating Hyperplane

    Chart1

    0.6

    0.4

    0.3

    0.2

    0.2

    1.5

    1.3

    1.6

    1

    1.3

    Sheet1

    3.50.6

    3.80.4

    30.3

    3.80.2

    3.20.2

    2.81.5

    2.81.3

    3.31.6

    2.41

    2.91.3

    Sheet1

    Sheet2

    Sheet3

  • Assume Separating PlanesConstraints:

    Distance to each plane:

  • Optimization Problem

  • How Do We Solve MPs?

  • Improving SearchDirection-step approach

    New Solution

  • Steepest DescentSearch direction equal to negative gradient

    Finding l is a one-dimensional optimization problem of minimizing

  • Newtons MethodTaylor series expansion

    The right hand side is minimized at

  • DiscussionComputing the inverse Hessian is difficultQuasi-Newton

    Conjugate gradient methodsDoes not account for constraintsPenalty methodsLagrangian methods, etc.

  • Non-separableAdd an error term to the constraints:

  • Wolfe DualSimpleconstraintsOnly placedata appears

  • Extension to Non-LinearKernel functions

    MappingTakes place ofdot product inWolfe dualHigh dimensionalHilbert space

  • Some Possible Kernels

  • In WekaWeka.classifiers.smoSupport vector machine for nominal data onlyDoes both linear and non-linear models

  • Optimization in DMOptimization

  • Bayesian ClassificationNave Bayes assumes independence between attributesSimple computationsBest classifier if assumption is trueBayesian Belief NetworksJoint probability distributionsDirected acyclic graphsNodes are random variables (attributes)Arcs represent the dependencies

  • Example: Bayesian NetworkFamily HistorySmokerLung CancerEmphysemaPositive X-RayDyspnea

  • Conditional ProbabilitiesRandomvariableOutcome of therandom variableThe node representingthe class attribute iscalled the output node

  • How Do we Learn?Network structureGiven/knownInferred or learned from the dataVariablesObservableHidden (missing values / incomplete data)

  • Case 1: Known Structure and Observable VariablesStraightforwardSimilar to Nave BayesCompute the entries of the conditional probability table (CPT) of each variable

  • Case 2: Known Structure and Some Hidden VariablesStill need to learn the CPT entriesLet S be a set of s training instances

    Let wijk be the CPT entry for variable Yi=yij having parents Ui=uik.

  • CPT Example

    FH,S

    FH,~S

    ~FH,S

    ~FH,~S

    LC

    0.8

    0.5

    0.7

    0.1

    ~LC

    0.2

    0.5

    0.3

    0.9

  • ObjectiveMust find the value of

    The objective is to maximize the likelihood of the data, that is,

    How do we do this?

  • Non-Linear MPCompute gradients:

    Move in the direction of the gradientFrom training dataLearning rate

  • Case 3: Unknown Network StructureNeed to find/learn the optimal network structure for the data

    What type of optimization problem is this?Combinatorial optimization (GA etc.)