data mining patrick j. gallagher march 21, 2006. what is data mining?

64
DATA MINING DATA MINING Patrick J. Gallagher Patrick J. Gallagher March 21, 2006 March 21, 2006

Upload: rosalind-kelly

Post on 04-Jan-2016

217 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

DATA MININGDATA MINING

Patrick J. GallagherPatrick J. Gallagher

March 21, 2006March 21, 2006

Page 2: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

What is Data Mining?What is Data Mining?

Page 3: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

DEFININTION: DEFININTION:

The The automated extraction of extraction of hidden predictive information hidden predictive information

from (large) databases.from (large) databases.

Page 4: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

DATA MINING TECHNIQUESDATA MINING TECHNIQUES

CLASSICAL:CLASSICAL: 1. Statistics1. Statistics2. Nearest Neighborhoods 2. Nearest Neighborhoods

3. Clustering3. Clustering

NEXT GENERATION:NEXT GENERATION: 1. Decision Trees1. Decision Trees2. Neural Networks 2. Neural Networks

3. Rule Induction3. Rule Induction

Page 5: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

THE CLASSICSTHE CLASSICS

Techniques discussed will be those Techniques discussed will be those that have been used for decadesthat have been used for decades

They have also used almost all of the They have also used almost all of the time on existing business problemstime on existing business problems

Page 6: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

1. STATISTICS1. STATISTICS

WHAT IS ITWHAT IS IT

Page 7: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

STATISTICS:STATISTICS: branch of mathematics branch of mathematics concerning the collection and the concerning the collection and the

description of datadescription of data

Page 8: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

Born of real world problems from Born of real world problems from business, biology and gamblingbusiness, biology and gambling

Knowing statistics helps the average Knowing statistics helps the average business person make better decisions business person make better decisions by allowing them to figure out risk and by allowing them to figure out risk and uncertainty when all facts either aren’t uncertainty when all facts either aren’t known or can be collectedknown or can be collected

Has been around for a long, long time Has been around for a long, long time (easily a century)(easily a century)

Page 9: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

HISTOGRAMSHISTOGRAMSSTATISTICAL STATISTICAL

SUMMARIZATIONSUMMARIZATION

Page 10: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

Database InformationDatabase InformationIDID NAMENAME PredictionPrediction AgeAge BalancBalanc

eeIncomeIncome EyesEyes GenderGender

11 AmyAmy NONO 6262 $0$0MediuMediumm BrownBrown FF

22 AlAl NONO 5353 $1,800$1,800MediuMediumm GreenGreen MM

33 BettyBetty NONO 4747$16,54$16,5433 HighHigh BrownBrown FF

44 BobBob YESYES 3232 $45$45MediuMediumm GreenGreen MM

55 CarlaCarla YESYES 2121 $2,300$2,300 HighHigh BlueBlue FF

66 CarlCarl NONO 2727 $5,400$5,400 HighHigh BrownBrown MM

77 DonnDonnaa YESYES 5050 $165$165 LowLow BlueBlue FF

88 DonDon YESYES 4646 $0$0 HighHigh BlueBlue MM

99EdnaEdna YESYES 2727 $500$500 LowLow BlueBlue FF

1010EdEd NONO 6868 $1,200$1,200 LowLow BlueBlue MM

Page 11: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

This histogram shows the number of customers with This histogram shows the number of customers with various eye colors. As you can see, the histogram various eye colors. As you can see, the histogram can show important information about the can show important information about the database.database.

Page 12: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

WHAT QUESTIONS CAN WHAT QUESTIONS CAN STATISTICS ANSWER?STATISTICS ANSWER?

Page 13: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

What patterns are there in my What patterns are there in my database?database?

What is the chance that an event will What is the chance that an event will occur?occur?

What patterns are significant?What patterns are significant?

What is a high level summary of the What is a high level summary of the data that gives me some idea of data that gives me some idea of what is contained in my databasewhat is contained in my database

Page 14: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

NOT ALL HISTOGRAMS ARE NOT ALL HISTOGRAMS ARE THIS SIMPLETHIS SIMPLE

Page 15: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

Complex histograms provide Complex histograms provide more information (Predictors)more information (Predictors)

Page 16: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

SUMMARY STATISTICSSUMMARY STATISTICS

Max - max value of predictorMax - max value of predictor Min - minimum value of predictorMin - minimum value of predictor Mean - average value of predictorMean - average value of predictor Median - value for a given predictor Median - value for a given predictor

that divides the database as nearly as that divides the database as nearly as possible into two databases of equal possible into two databases of equal number of records.number of records.

Mode – common value for the Mode – common value for the predictorpredictor

Variance – measure of how spread out Variance – measure of how spread out the values are from the average valuethe values are from the average value

Page 17: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

STATISTICS FOR PREDICTIONSTATISTICS FOR PREDICTION

Prediction = ?Prediction = ?

A = RegressionA = Regression

B = SimulationB = Simulation

C = DecisionC = Decision

Page 18: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

PREDICTION = REGRESIONPREDICTION = REGRESION

“A”“A”

Page 19: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

LINEAR REGRESSIONLINEAR REGRESSION

One predictor and a prediction. The One predictor and a prediction. The relationship between the two can be relationship between the two can be mapped on a two dimensional space mapped on a two dimensional space and the records plotted for the and the records plotted for the prediction values along the Y axis and prediction values along the Y axis and the predictor values along the X axisthe predictor values along the X axis

Seeks to build a predictive model that Seeks to build a predictive model that is a line that maps between each is a line that maps between each predictor value to a prediction value.predictor value to a prediction value.

Page 20: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

Sample Linear Regression Sample Linear Regression Predictive ModelPredictive Model

The line will take a given The line will take a given value for a predictor and value for a predictor and map it into a given value map it into a given value for a predictionfor a prediction

Equation is: Equation is: Prediction=a+b*predictoPrediction=a+b*predictorr

Trick with predictive Trick with predictive modeling is to find the modeling is to find the model that best model that best minimizes the error minimizes the error

Page 21: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

2. NEAREST NEIGHBOR2. NEAREST NEIGHBORWhat does it mean?What does it mean?

Page 22: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

Nearest NeighborNearest Neighbor

In order to predict what a In order to predict what a prediction value is in one prediction value is in one

record look for records with record look for records with similar predictor values in the similar predictor values in the historical database and use historical database and use

the prediction value from the the prediction value from the record that it “nearest” to the record that it “nearest” to the

unclassified recordunclassified record

Page 23: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

NEAREST NEIGHBOR NEAREST NEIGHBOR EXAMPLE ?EXAMPLE ?

Page 24: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

EXAMPLEEXAMPLE

Determining that people in yourDetermining that people in your

neighborhood have an income of overneighborhood have an income of over

$100,000 per year$100,000 per year

NEAREST NEIGHBOR ASSUMESNEAREST NEIGHBOR ASSUMES

Your income is also over $100,000Your income is also over $100,000

Page 25: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

Prediction for a prediction value in one Prediction for a prediction value in one record is determined by looking for similar record is determined by looking for similar predictor values in the historical database predictor values in the historical database and use the prediction value from the and use the prediction value from the record that is nearest to the unclassified record that is nearest to the unclassified record record

(ex: salaries of people in your neighborhood)(ex: salaries of people in your neighborhood)

The techniques are among the easiest to The techniques are among the easiest to use and understand because the techniques use and understand because the techniques work similar to the ways a person thinkswork similar to the ways a person thinks

Are among the oldest techniques used in Are among the oldest techniques used in data mining.data mining.

Page 26: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

NEAREST NEIGHBORNEAREST NEIGHBORPREDICTION TECHNIQUEPREDICTION TECHNIQUE

USES USES

BusinessBusiness

Stock Market DataStock Market Data

Page 27: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

PREDICTIONPREDICTIONIN NEAREST NEIGHBOR IN NEAREST NEIGHBOR

MEANS:MEANS:

Objects that are “NEAR” to each other Objects that are “NEAR” to each other will have similar prediction values as will have similar prediction values as

well.well.

Thus if you know the prediction value Thus if you know the prediction value of one of the objects you can predict of one of the objects you can predict

it for its nearest neighbor.it for its nearest neighbor.

Page 28: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

BUSINESSBUSINESS

Text RetrievalText Retrieval :This particular :This particular technique is used to find other technique is used to find other

documents that share important documents that share important characteristics with those documents characteristics with those documents

that have been marked as that have been marked as interesting.interesting.

Page 29: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

STOCK MARKET DATASTOCK MARKET DATA

The input data is just a long series of stock The input data is just a long series of stock prices over time without any particular prices over time without any particular

record that could be considered to be an record that could be considered to be an object.object.

Example:Example:Predictor ValuesPredictor Values

10: 12: 14: 15: 10: 13: 11: 14: 15:10: 12: 14: 15: 10: 13: 11: 14: 15:Prediction ValuePrediction Value11 (1011 (10thth number) number)

Page 30: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

3. CLUSTERING3. CLUSTERINGWhat does it mean?What does it mean?

Page 31: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

CLUSTERINGCLUSTERING

Clustering is a method which like Clustering is a method which like records are grouped together in records are grouped together in order to give the end user a high order to give the end user a high

level view of what is going on in the level view of what is going on in the data base and business.data base and business.

Page 32: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

CLUSTERINGCLUSTERINGIn the real worldIn the real world

Two clustering systems are the PRIZM™Two clustering systems are the PRIZM™

system from Claritias Corporation and system from Claritias Corporation and MicroVision™ from Equifax Corporation. MicroVision™ from Equifax Corporation. These companies have grouped the These companies have grouped the population by demographic information population by demographic information into segments that they believe are useful into segments that they believe are useful direct marketing and sales.direct marketing and sales.

Page 33: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

NAMENAME INCOMEINCOME AGAGEE

EDUCATIOEDUCATIONN

VENDOVENDORR

Blue Blood Blue Blood EstatesEstates

WealthyWealthy 35-5435-54 CollegeCollege ClaritasClaritas

Prism™Prism™

Shot Gun Shot Gun and Pickupsand Pickups

MiddleMiddle 35-6435-64 High SchoolHigh School ClaritasClaritas

Prism™Prism™

Southside Southside CityCity

PoorPoor MixMix Grade SchoolGrade School ClaritasClaritas

Prism™Prism™

Living Off Living Off the Landthe Land

Middle – PoorMiddle – Poor SchoolSchool

Age Age

FamilieFamiliess

LowLow EquifaxEquifax

MicroVision™MicroVision™

University University USAUSA

Very LowVery Low Young-Young-

MixMixMedium - HighMedium - High EquifaxEquifax

MicroVision™MicroVision™

Sunset YearsSunset Years MediumMedium SeniorsSeniors MediumMedium EquifaxEquifax

MicroVision™MicroVision™

Page 34: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

CLUSTERING VS NEAREST CLUSTERING VS NEAREST NEIGHBORNEIGHBOR

Nearest NeighborNearest Neighbor

Used for prediction as well as Used for prediction as well as consolidationconsolidation

Space is defined by the problem Space is defined by the problem to be solved. to be solved. (Supervised (Supervised learning technique)learning technique)

Generally only uses distance Generally only uses distance metrics to determine nearness.metrics to determine nearness.

ClusteringClustering

Used mostly for consolidating data Used mostly for consolidating data into a high-level view and general into a high-level view and general grouping of records into like grouping of records into like behaviors.behaviors.

Space is defined as default n-Space is defined as default n-dimensional space, or is defined by dimensional space, or is defined by the user or is a predefined space the user or is a predefined space driven by past experience. driven by past experience. ((Unsupervised learning Unsupervised learning technique)technique)

Can use other metrics besides Can use other metrics besides distance to determine nearest of distance to determine nearest of two records – for example linking two records – for example linking two points together.two points together.

Page 35: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

What are the two main types of What are the two main types of Clustering techniques?Clustering techniques?

Page 36: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

HIERARCHICALHIERARCHICAL

NON-HIERARCHICALNON-HIERARCHICAL

CLUSTERINGCLUSTERING

Page 37: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

HIERARCHYHIERARCHYofof

CLUSTERSCLUSTERS

Page 38: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

NON- HIERARCHIAL NON- HIERARCHIAL CLUSTERINGCLUSTERING

1. Single Pass Methods1. Single Pass Methods

2. Reallocation Methods2. Reallocation Methods

Page 39: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

Hierarchical ClusteringHierarchical Clustering

It is created by starting either at the top and It is created by starting either at the top and subdividing subdividing (dividing clustering)(dividing clustering) or starting or starting at the bottom with as many clusters as at the bottom with as many clusters as there are records and merging there are records and merging (agglomerative clustering)(agglomerative clustering)..

Has advantage over non-hierarchical in that Has advantage over non-hierarchical in that the clusters are solely by the data and that the clusters are solely by the data and that the number of clusters can be increased or the number of clusters can be increased or decreased by simply moving up and down decreased by simply moving up and down the hierarchy.the hierarchy.

Page 40: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

NEXT GENERATIONNEXT GENERATION

Represent the most often used Represent the most often used techniques that have been developed techniques that have been developed over the past two decades of research.over the past two decades of research.

It can be used for either discovering It can be used for either discovering new information within large databases new information within large databases or for building predictive models.or for building predictive models.

Page 41: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

NEXT GENERATIONNEXT GENERATION

1. DECISION TREES1. DECISION TREES

2. NEURAL NETWORK2. NEURAL NETWORK

3. RULE INDUCTION3. RULE INDUCTION

Page 42: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

1.1. DECISION TREESDECISION TREESWhat are they ?What are they ?

Page 43: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

DECISION TREESDECISION TREES

A predictive model that, as its name A predictive model that, as its name implies, can be viewed as a tree. implies, can be viewed as a tree.

Specifically, each branch of the tree Specifically, each branch of the tree is a classification question and the is a classification question and the leaves of the tree are partitions of leaves of the tree are partitions of the dataset with their classificationthe dataset with their classification

Page 44: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

DECISION TREEDECISION TREE EXAMPLE EXAMPLE

Page 45: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

Similar technologies have been around Similar technologies have been around for almost 20 years and early versions of for almost 20 years and early versions of the algorithms date back in the 1960sthe algorithms date back in the 1960s

Originally, these techniques were Originally, these techniques were developed for statisticians to automate developed for statisticians to automate the process of determining which fields the process of determining which fields in their database were actually useful or in their database were actually useful or correlated with the particular problem correlated with the particular problem that they were trying to understand.that they were trying to understand.

DECISION TREE DECISION TREE HISITORYHISITORY

Page 46: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

DECISION TREE DECISION TREE USESUSES

EXPLORATIONEXPLORATION – looks at predictors and – looks at predictors and values that are chosen for each split of the values that are chosen for each split of the tree. Often times, these predictors tree. Often times, these predictors provide usable insights or propose provide usable insights or propose questions that need to be answered.questions that need to be answered.

DATA PREPROCESSINGDATA PREPROCESSING – can be used on – can be used on the first pass of data mining to create a the first pass of data mining to create a subset of useful predictors that can be subset of useful predictors that can be used in neural networks, nearest neighbor used in neural networks, nearest neighbor and normal statistical routines.and normal statistical routines.

PREDICTIONPREDICTION – used as a by product by – used as a by product by statisticians because decision trees are statisticians because decision trees are used for exploratory analysis.used for exploratory analysis.

Page 47: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

DECISION TREE DECISION TREE ALGORITHMSALGORITHMS

ID3ID3

CARTCART

CHAIDCHAID

Page 48: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

ID3ID3 Developed in late 1970’s by J. Ross QuinlanDeveloped in late 1970’s by J. Ross Quinlan First Decision Tree algorithmFirst Decision Tree algorithm

Based on previous inference systems Based on previous inference systems and concept learning systems from and concept learning systems from decades preceding.decades preceding.

Initial used for game playing strategies for Initial used for game playing strategies for chess games.chess games.

Picks predictors and splitting values based Picks predictors and splitting values based on “gain” and information that the split/s on “gain” and information that the split/s provide.provide.

The difference between the entropy of the The difference between the entropy of the original segment and the accumulated original segment and the accumulated entropies of the resulting split segments.entropies of the resulting split segments.

Page 49: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

ID3 to C4.5 ID3 to C4.5 ENHANCEMENTSENHANCEMENTS

Predictors with missing values can Predictors with missing values can still be used.still be used.

Predictors with continuous values Predictors with continuous values can be used.can be used.

Pruning is introducedPruning is introduced Rule derivationRule derivation

Page 50: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

CARTCART

Stands for Classification and Regression TreesStands for Classification and Regression Trees

Data exploration and prediction algorithm Data exploration and prediction algorithm developed by Leo Breiman, Jerome Friedman, developed by Leo Breiman, Jerome Friedman, Richard Olshen and Charles Stone.Richard Olshen and Charles Stone.

Each predictor is picked on how well it teases Each predictor is picked on how well it teases apart the records with different predictions.apart the records with different predictions.

Page 51: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

CHAIDCHAID

Stands for Chi Square Automatic Stands for Chi Square Automatic Interaction DetectorInteraction Detector

Similar to CARTSimilar to CART It builds a decision tree It builds a decision tree

Different from CARTDifferent from CART In the way it chooses its splits.In the way it chooses its splits.

Page 52: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

2. NEURAL NETWORKS2. NEURAL NETWORKSWhat is it?What is it?

Page 53: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

NEURAL NETWORKNEURAL NETWORK

Computer programs implementing Computer programs implementing sophisticated pattern detection sophisticated pattern detection

and machine learning algorithms and machine learning algorithms on a computer to build predictive on a computer to build predictive

models from large historical models from large historical databases.databases.

Page 54: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

Artificial neural networks derive their name from Artificial neural networks derive their name from their historical development which started off their historical development which started off with the premise that machines could be made to with the premise that machines could be made to “think” if scientists found ways to mimic the “think” if scientists found ways to mimic the structure and functioning of the human brain on structure and functioning of the human brain on the computer. the computer.

Greatest breakthroughs in neural networks in Greatest breakthroughs in neural networks in recent years have been in there application to recent years have been in there application to more mundane real world problems like customer more mundane real world problems like customer response prediction or fraud detection.response prediction or fraud detection.

They technically are considered to “learn” and They technically are considered to “learn” and make better predictions by detecting patterns make better predictions by detecting patterns using analogies in similar ways that humans do.using analogies in similar ways that humans do.

Page 55: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

NEURAL NETWORKNEURAL NETWORKUSESUSES

ClusteringClustering

Outlier AnalysisOutlier Analysis

Example: Wine DistributorExample: Wine DistributorWine distributor store stands out as making significantly Wine distributor store stands out as making significantly lower profit. Upon further examination the distributor was lower profit. Upon further examination the distributor was delivering product but not collecting payment.delivering product but not collecting payment.

Feature ExtractionFeature Extraction

Page 56: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

Neural NetworksNeural Networks(Components)(Components)

Node- corresponds to the neuron in the Node- corresponds to the neuron in the human brain.human brain.

Link- it corresponds to the connections Link- it corresponds to the connections between neurons.between neurons.

Page 57: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

NEURAL NETWORKNEURAL NETWORKSampleSample

Page 58: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

NEURAL NETWORKNEURAL NETWORKTYPESTYPES

Back Propagation-Back Propagation- refers to the propagation of refers to the propagation of the error backwards from the output nodes the error backwards from the output nodes through the hidden layers and to the input through the hidden layers and to the input layer.layer.

Kohonen Feature Maps-Kohonen Feature Maps- developed in the developed in the 1970’s and are feed forward Neural Network 1970’s and are feed forward Neural Network generally with no hidden layer.generally with no hidden layer.- Used for unsupervised learning and clustering.- Used for unsupervised learning and clustering.

Radial Basic FunctionRadial Basic Function – represent a hybrid – represent a hybrid between nearest neighbor and neural network between nearest neighbor and neural network classification. classification. - Used for supervised and learning- Used for supervised and learning

Page 59: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

3. RULE INDUCTION3. RULE INDUCTIONWhat is it?What is it?

Page 60: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

RULE INDUCTIONRULE INDUCTION

Is one of the major forms of data miningIs one of the major forms of data mining

and the most common form of and the most common form of knowledgeknowledge

discovery in unsupervised learning discovery in unsupervised learning systems.systems.

It mines for a rule that is “interesting”.It mines for a rule that is “interesting”.

Page 61: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

It is a massive undertaking were all It is a massive undertaking were all possible patterns are systematically possible patterns are systematically pulled out of the data and then an pulled out of the data and then an accuracy and significance are added to accuracy and significance are added to them that tell the user how strong the them that tell the user how strong the pattern is and how likely it is to occur pattern is and how likely it is to occur again.again.

Rule induction systems are highly Rule induction systems are highly automated and are probably the best of automated and are probably the best of data mining techniques for exposing all data mining techniques for exposing all possible predictive patterns in a possible predictive patterns in a databasedatabase

Page 62: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

Neural NetworkNeural NetworkVSVS

Rule InductionRule InductionNEURAL NETWORKSNEURAL NETWORKS

Extremely proficient Extremely proficient and saying exactly and saying exactly what must be done in what must be done in a predictive task with a predictive task with little explanation.little explanation. Example- Who do I giveExample- Who do I give credit to and who do Icredit to and who do I

deny credit to.deny credit to.

RULE INDUCTIONRULE INDUCTION

When used for When used for prediction, they are prediction, they are like having a like having a committee of trusted committee of trusted advisors each with a advisors each with a slightly different slightly different opinion as to what to opinion as to what to do but relatively well do but relatively well grounded reasoning grounded reasoning and a good explanation and a good explanation for why it should be for why it should be done.done.

Page 63: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

What is a RULE?What is a RULE?

““if this and this and this then if this and this and this then this.”this.”

EXAMPLESEXAMPLES

- If paper plates then plastic forks.- If paper plates then plastic forks.

- If dip then potato chips.- If dip then potato chips.

Page 64: DATA MINING Patrick J. Gallagher March 21, 2006. What is Data Mining?

PresenterPresenter

Dr. Balaji PadmanabhanDr. Balaji PadmanabhanAssistant Professor of Operations and Information Assistant Professor of Operations and Information

ManagementManagementThe Wharton School, University of PennsylvaniaThe Wharton School, University of Pennsylvania

Teaches:Teaches: Enabling (Information) Technologies Enabling (Information) Technologies Data Mining / Decision Support SystemsData Mining / Decision Support Systems Introduction to the Computer as an Analysis ToolIntroduction to the Computer as an Analysis Tool