basic data mining

Basic concepts of Data Mining, Clustering and Genetic AlgorithmsTsai-Yang JeaDepartment of Computer Science and EngineeringSUNY at Buffalo

Data Mining MotivationMechanical production of data need for mechanical consumption of data

Large databases = vast amounts of information

Difficulty lies in accessing it

KDD and Data MiningKDD: Extraction of knowledge from data

non-trivial extraction of implicit, previously unknown & potentially useful knowledge from data

Data Mining: Discovery stage of the KDD process

Data Mining TechniquesQuery toolsStatistical techniquesVisualizationOn-line analytical processing (OLAP)ClusteringClassificationDecision treesAssociation rulesNeural networksGenetic algorithmsAny technique that helps to extract more out of data is useful

Whats ClusteringClustering is a kind of unsupervised learning.Clustering is a method of grouping data that share similar trend and patterns.Clustering of data is a method by which large sets of data is grouped into clusters of smaller sets of similar data.Example:Thus, we see clustering means grouping of data or dividing a large data set into smaller data sets of some similarity.After clustering:

The usage of clusteringSome engineering sciences such as pattern recognition, artificial intelligence have been using the concepts of cluster analysis. Typical examples to which clustering has been applied include handwritten characters, samples of speech, fingerprints, and pictures.In the life sciences (biology, botany, zoology, entomology, cytology, microbiology), the objects of analysis are life forms such as plants, animals, and insects. The clustering analysis may range from developing complete taxonomies to classification of the species into subspecies. The subspecies can be further classified into subspecies.Clustering analysis is also widely used in information, policy and decision sciences. The various applications of clustering analysis to documents include votes on political issues, survey of markets, survey of products, survey of sales programs, and R & D.

A Clustering ExampleIncome: HighChildren:1Car:LuxuryIncome: LowChildren:0Car:CompactCar: Sedan andChildren:3Income: MediumIncome: MediumChildren:2Car:TruckCluster 1Cluster 2Cluster 3Cluster 4

Different ways of representing clusters(b)gifecb

K Means Clustering(Iterative distance-based clustering)K means clustering is an effective algorithm to extract a given number of clusters of patterns from a training set. Once done, the cluster locations can be used to classify patterns into distinct classes.

K means clustering(Cont.)Select the k cluster centers randomly.Store the k cluster centers.Loop until the change in cluster means is less the amount specified by the user.

The drawbacks of K-means clusteringThe final clusters do not represent a global optimization result but only the local one, and complete different final clusters can arise from difference in the initial randomly chosen cluster centers. (fig. 1)We have to know how many clusters we will have at the first.

Drawback of K-means clustering(Cont.)Figure 1

Clustering with Genetic AlgorithmIntroduction of Genetic AlgorithmElements consisting GAsGenetic RepresentationGenetic operators

Introduction of GAsInspired by biological evolution.Many operators mimic the process of the biological evolution including Natural selectionCrossoverMutation

Elements consisting GAsIndividual (chromosome):feasible solution in an optimization problemPopulationSet of individualsShould be maintained in each generation

Elements consisting GAsGenetic operators. (crossover, mutation)Define the fitness function.The fitness function takes a single chromosome as input and returns a measure of the goodness of the solution represented by the chromosome.

Genetic RepresentationThe most important starting point to develop a genetic algorithmEach gene has its special meaningBased on this representation, we can define fitness evaluation function, crossover operator, mutation operator.

Genetic Representation (Cont.)Examples 1GeneAllele value

Genetic Representation (Cont.)Examples 2 ( In clustering problem)Each chromosome represents a set of clusters; each gene represents an object; each allele value represents a cluster. Genes with the same allele value are in the same cluster.1214355ABCDEFG

CrossoverExchange features of two individuals to produce two offspring (children)Selected mates may have good properties to survive in next generationsSo, we can expect that exchanging features may produce other good individuals

Crossover (cont.)Single-point Crossover

Two-point Crossover

Uniform Crossover1101100100001010111011010101001000110110010000101011100100011011000000101110110010000101011000101011000100011001Crossover template

MutationUsually change a single bit in a bit stringThis operator should happen with very low probability.

Mutation point(random)

Typical ProceduresCrossover mates are probabilistically selected based on their fitness value.Crossover pointrandomly selectedold generationnew generationMutation point(random)Probabilistically select individuals

How to apply GA on a clustering problemPreparing the chromosomes

Defining genetic operatorsFusion: takes two unique allele values and combines them into a single allele value, combining two clusters into one.

Fission: takes a single allele value and gives it a different random allele value, breaking a cluster apart.

Defining fitness functions

Example: (Cont.)CrossoverMutationFusionFissionOld generationNew generationSelect the chromosomes according to the fitness function.

Finally

Why we use Data Mining?The amount of data in our lives seems to go on and on increasing. Most international organizations produce more information in a week than many people could read in a lifetime. And, the computers make it too easy to save things that we would have trashed before. Because the inexpensive multi-gigabyte disks make it too easy to postpone decisions about what to with all this stuff-- we simply buy another disk and keep it all. And there are a lot of electronic equipment can record our decisions, our choices in the supermarket, our financial habits, our coming and goings. The World Wide Web overwhelms us with information; mean while, every choice we make in recorded.

Now the problem is how to analysis the database and find something meaningful and useful in this huge and growing database. Data mining is the automatic (semi-automatic) process of discovering patterns in data.KDD is knowledge discovery in database. The discovery of KDD has to be non-trivial. For example, in a supermarket database, the pork and beef are form the same supplier, and in the supplier is at Taxes. Then we may discover that all pork and beef are from Taxes. This is a trivial knowledge.

So, data mining is the process of finding trends, patterns and, relations in data. The objective of this process is to sort through large quantities of data and discover new information. The benefit of data mining is to turn this newfound knowledge into actionable results, such as increasing a customers likelihood to buy, or decreasing the number of fraudulent claims.In todays presentation, I will cover the basic concepts of Clustering, Genetic Algorithms and how to use genetic algorithms in clustering problem.

Feng Chen will cover the visualization part.For example, in the case of fraudulent claims, the records may naturally separate into two classes. One of the categories may correspond to normal claims and the other may correspond to fraudulent claims. Of course, there may be some legitimate claims that are mislabeled as fraudulent, and vice versa.

For example, in direct marketing you may want to examine market segments. If you have customer segments on an age range ( ages 0-5, 6-10, 11-15, 16-20, etc.), it may be interesting to cluster the groups into coarser groups like babies, children, teenagers, etc. Cluster helps determine which groups should be together. Retailers want to know where similarities exist in their customer base so they can create and understand different groups to which they can sell and market. They will use a database with rows of customer information and attempt to create customer segments. Clustering techniques try to look for similarities and differences within a data set and group similar rows together into clusters or segments. For example, a data set may contain many affluent customers with no children and also may have customers with lower incomes and one parent in the family. During the discovery process, this difference can be used to separate the data into two natural groupings. If more such similarities and differences exist, the data set could be further subdivided.Case (a) is the simplest case this involves associating a cluster number with each instance, which might be depicted by laying the instances out in two dimensions and partitioning the space to show each cluster.

Case (b), some algorithms allow one instance to belong to more than one cluster, so the diagram might lay the instances out in two dimensions and draw overlapping subsets representing each cluster.

Case (c), some algorithm associate instance with clusters probability or degree of membership which it belongs to each of the clusters.

Case (d), some algorithms provide a hierarchical structure of clusters, each of which divides into its own subclusters at the next level down, and so on.This is the "simplest" version for K means clustering.There are a lot of alternative algorithms. In some books, the number of start points is randomly selected. Then we can use "combination" and "split" to achieve the k we need.Here is an example to cluster 4 objects into two clusters. Each object has two attribute in it.Genetic algorithm performs searching in a manner similar to natural selection in nature. By mimicking survival of the fittest, genetic algorithm try to evolve a solution to a search problem. A population of possible solutions is used. The more fit members of this population (i.e. those that are nearer to the solution) are more likely to mate and produce the next generation. As the generations pass, the members of the population should get fitter and fitter (i.e. closer and closer to the solution.)

A chromosome is a feasible solution. A set of chromosomes is a set of feasible solutions. A population is a set of chromosomes, that is a set of feasible solutions.

A chromosome is a feasible solution. A set of chromosomes is a set of feasible solutions.Genetic algorithms accept their input coded as a finite length string (or chromosome.) Each of the elements in the chromosome is a gene, and each gene has an allele value.The fitness functions are depended on the nature of problem. In clustering problem we can define the inner cluster distance and between clusters distance as the fitness function.

basic data mining

Documents