role of maths sat in dm

48
Role of Mathematics / Statistics Role of Mathematics / Statistics in DATA MINING in DATA MINING By By Dr.S.Sridhar, Ph.D.(JNUD), Dr.S.Sridhar, Ph.D.(JNUD), RACI(Paris, NICE), RMR(USA), RZFM(Germany) RACI(Paris, NICE), RMR(USA), RZFM(Germany) RIEEEProc., RIETCom., LMISTE, LMCSI RIEEEProc., RIETCom., LMISTE, LMCSI Vice Chancellor-DSI-RAK Vice Chancellor-DSI-RAK

Upload: s-sridhar

Post on 16-Aug-2015

22 views

Category:

Documents


0 download

TRANSCRIPT

Role of Mathematics / Statistics Role of Mathematics / Statistics in DATA MININGin DATA MINING

By By

Dr.S.Sridhar, Ph.D.(JNUD),Dr.S.Sridhar, Ph.D.(JNUD),RACI(Paris, NICE), RMR(USA), RZFM(Germany)RACI(Paris, NICE), RMR(USA), RZFM(Germany)

RIEEEProc., RIETCom., LMISTE, LMCSIRIEEEProc., RIETCom., LMISTE, LMCSIVice Chancellor-DSI-RAKVice Chancellor-DSI-RAK

Data Mining DefinitionData Mining Definition

Finding hidden information in a Finding hidden information in a databasedatabase

Fit data to a modelFit data to a model

Similar termsSimilar terms

Data Mining AlgorithmData Mining Algorithm

Objective: Fit Data to a ModelObjective: Fit Data to a Model– DescriptiveDescriptive– PredictivePredictive

Preference – Technique to choose the Preference – Technique to choose the best modelbest model

Search – Technique to search the dataSearch – Technique to search the data– ““Query”Query”

Database Processing vs. Data Database Processing vs. Data Mining ProcessingMining Processing

QueryQuery– Well definedWell defined– SQLSQL

QueryQuery– Poorly definedPoorly defined– No precise query languageNo precise query language

DataData– Operational dataOperational data

OutputOutput– PrecisePrecise– Subset of databaseSubset of database

DataData– Not operational dataNot operational data

OutputOutput– FuzzyFuzzy– Not a subset of databaseNot a subset of database

Query ExamplesQuery Examples DatabaseDatabase

Data MiningData Mining

– Find all customers who have purchased milkFind all customers who have purchased milk

– Find all items which are frequently purchased Find all items which are frequently purchased with milk. (association rules)with milk. (association rules)

– Find all credit applicants with last name of KUMAR.Find all credit applicants with last name of KUMAR.– Identify customers who have purchased more Identify customers who have purchased more than INR10,000 in the last month.than INR10,000 in the last month.

– Find all credit applicants who are poor credit Find all credit applicants who are poor credit risks. (classification)risks. (classification)– Identify customers with similar buying habits. Identify customers with similar buying habits. (Clustering)(Clustering)

Data Mining Models and TasksData Mining Models and Tasks

Basic Data Mining TasksBasic Data Mining Tasks Classification Classification maps data into predefined groups maps data into predefined groups

or classesor classes– Supervised learningSupervised learning– Pattern recognitionPattern recognition– PredictionPrediction

RegressionRegression is used to map a data item to a real is used to map a data item to a real valued prediction variable.valued prediction variable.

Clustering Clustering groups similar data together into groups similar data together into clusters.clusters.– Unsupervised learningUnsupervised learning– SegmentationSegmentation– PartitioningPartitioning

Basic Data Mining Tasks Basic Data Mining Tasks (cont’d)(cont’d)

Summarization Summarization maps data into subsets with maps data into subsets with associated simple descriptions.associated simple descriptions.– CharacterizationCharacterization– GeneralizationGeneralization

Link AnalysisLink Analysis uncovers relationships among uncovers relationships among data.data.– Affinity AnalysisAffinity Analysis– Association RulesAssociation Rules– Sequential Analysis determines sequential Sequential Analysis determines sequential

patterns.patterns.

Ex: Time Series AnalysisEx: Time Series Analysis Example: Stock MarketExample: Stock Market Predict future valuesPredict future values Determine similar patterns over timeDetermine similar patterns over time Classify behaviorClassify behavior

Data Mining vs. KDDData Mining vs. KDD

Knowledge Discovery in Databases Knowledge Discovery in Databases (KDD):(KDD): process of finding useful process of finding useful information and patterns in data.information and patterns in data.

Data Mining:Data Mining: Use of algorithms to Use of algorithms to extract the information and patterns extract the information and patterns derived by the KDD process. derived by the KDD process.

KDD ProcessKDD Process

Selection:Selection: Obtain data from various sources. Obtain data from various sources. Preprocessing:Preprocessing: Cleanse data. Cleanse data. Transformation:Transformation: Convert to common format. Convert to common format.

Transform to new format.Transform to new format. Data Mining:Data Mining: Obtain desired results. Obtain desired results. Interpretation/Evaluation:Interpretation/Evaluation: Present results Present results

to user in meaningful manner.to user in meaningful manner.

Data Mining DevelopmentData Mining Development•Similarity Measures•Hierarchical Clustering•IR Systems•Imprecise Queries•Textual Data•Web Search Engines

•Bayes Theorem•Regression Analysis•EM Algorithm•K-Means Clustering•Time Series Analysis

•Neural Networks•Decision Tree Algorithms

•Algorithm Design Techniques•Algorithm Analysis•Data Structures

•Relational Data Model•SQL•Association Rule Algorithms•Data Warehousing•Scalability Techniques

KDD IssuesKDD Issues

Human InteractionHuman Interaction OverfittingOverfitting InterpretationInterpretation Visualization Visualization Large DatasetsLarge Datasets High DimensionalityHigh Dimensionality

KDD Issues (cont’d)KDD Issues (cont’d)

Multimedia DataMultimedia Data Missing DataMissing Data Irrelevant DataIrrelevant Data Noisy DataNoisy Data Changing DataChanging Data IntegrationIntegration ApplicationApplication

Data Mining MetricsData Mining Metrics

UsefulnessUsefulness Return on Investment (ROI)Return on Investment (ROI) AccuracyAccuracy Space/TimeSpace/Time

Database Perspective on Data Database Perspective on Data MiningMining

ScalabilityScalability Real World DataReal World Data UpdatesUpdates Ease of UseEase of Use

Visualization TechniquesVisualization Techniques

GraphicalGraphical GeometricGeometric Icon-basedIcon-based Pixel-basedPixel-based HierarchicalHierarchical

DB & OLTP SystemsDB & OLTP Systems SchemaSchema

– (ID,Name,Address,Salary,JobNo)(ID,Name,Address,Salary,JobNo) Data ModelData Model

– ERER– RelationalRelational

TransactionTransaction Query:Query:

SELECT NameSELECT NameFROM TFROM TWHERE Salary > 100000WHERE Salary > 100000

DM: Only imprecise queriesDM: Only imprecise queries

Fuzzy Sets and LogicFuzzy Sets and Logic Fuzzy Set:Fuzzy Set: Set membership function is a real valued Set membership function is a real valued

function with output in the range [0,1].function with output in the range [0,1]. f(x): Probability x is in F.f(x): Probability x is in F. 1-f(x): Probability x is not in F.1-f(x): Probability x is not in F. EX:EX:

– T = {x | x is a person and x is tall}T = {x | x is a person and x is tall}– Let f(x) be the probability that x is tallLet f(x) be the probability that x is tall– Here f is the membership functionHere f is the membership function

DM: DM: Prediction and classification are fuzzy.Prediction and classification are fuzzy.

Fuzzy SetsFuzzy Sets

Classification/Prediction is Classification/Prediction is FuzzyFuzzy

Loan

Amnt

Simple Fuzzy

Accept Accept

RejectReject

Information Retrieval Information Retrieval

Information Retrieval (IR):Information Retrieval (IR): retrieving desired retrieving desired information from textual data.information from textual data.

Library ScienceLibrary Science Digital LibrariesDigital Libraries Web Search EnginesWeb Search Engines Traditionally keyword basedTraditionally keyword based Sample query:Sample query:

Find all documents about “data mining”.Find all documents about “data mining”.

DM: Similarity measures; DM: Similarity measures; Mine text/Web data.Mine text/Web data.

IR Query Result Measures IR Query Result Measures and Classificationand Classification

IR Classification

Relational View of DataRelational View of Data

ProdID LocID Date Quantity UnitPrice 123 Dallas 022900 5 25 123 Houston 020100 10 20 150 Dallas 031500 1 100 150 Dallas 031500 5 95 150 Fort

Worth 021000 5 80

150 Chicago 012000 20 75 200 Seattle 030100 5 50 300 Rochester 021500 200 5 500 Bradenton 022000 15 20 500 Chicago 012000 10 25 1

Cube view of DataCube view of Data

Aggregation HierarchiesAggregation Hierarchies

Star SchemaStar Schema

Data WarehousingData Warehousing

““Subject-oriented, integrated, time-variant, nonvolatile” Subject-oriented, integrated, time-variant, nonvolatile” William InmonWilliam Inmon

Operational Data:Operational Data: Data used in day to day needs of Data used in day to day needs of company.company.

Informational Data:Informational Data: Supports other functions such as Supports other functions such as planning and forecasting.planning and forecasting.

Data mining tools often access data warehouses rather Data mining tools often access data warehouses rather than operational data.than operational data.

DM: May access data in warehouse.DM: May access data in warehouse.

OLAP OperationsOLAP Operations

Single Cell Multiple Cells Slice Dice

Roll Up

Drill Down

StatisticsStatistics Simple descriptive modelsSimple descriptive models Statistical inference:Statistical inference: generalizing a model generalizing a model

created from a sample of the data to the entire created from a sample of the data to the entire dataset.dataset.

Exploratory Data Analysis:Exploratory Data Analysis: – Data can actually drive the creation of the Data can actually drive the creation of the

modelmodel– Opposite of traditional statistical view.Opposite of traditional statistical view.

Data mining targeted to business userData mining targeted to business user

DM: Many data mining methods come DM: Many data mining methods come from statistical techniques. from statistical techniques.

Pattern Matching Pattern Matching (Recognition)(Recognition)

Pattern Matching:Pattern Matching: finds occurrences of finds occurrences of a predefined pattern in the data.a predefined pattern in the data.

Applications include speech recognition, Applications include speech recognition, information retrieval, time series information retrieval, time series analysis.analysis.

DM: Type of classification.DM: Type of classification.

Data Mining Techniques OutlineData Mining Techniques Outline

StatisticalStatistical– Point EstimationPoint Estimation– Models Based on SummarizationModels Based on Summarization– Bayes TheoremBayes Theorem– Hypothesis TestingHypothesis Testing– Regression and CorrelationRegression and Correlation

Similarity MeasuresSimilarity Measures Decision TreesDecision Trees Neural NetworksNeural Networks

– Activation FunctionsActivation Functions

Genetic AlgorithmsGenetic Algorithms

Goal:Goal: Provide an overview of basic data Provide an overview of basic data mining techniquesmining techniques

Point EstimationPoint Estimation Point Estimate:Point Estimate: estimate a population estimate a population

parameter.parameter. May be made by calculating the parameter for a May be made by calculating the parameter for a

sample.sample. May be used to predict value for missing data.May be used to predict value for missing data. Ex: Ex:

– R contains 100 employeesR contains 100 employees– 99 have salary information99 have salary information– Mean salary of these is $50,000Mean salary of these is $50,000– Use $50,000 as value of remaining employee’s Use $50,000 as value of remaining employee’s

salary. salary. Is this a good idea?Is this a good idea?

Estimation ErrorEstimation Error

Bias: Bias: Difference between expected value and Difference between expected value and actual value.actual value.

Mean Squared Error (MSE):Mean Squared Error (MSE): expected value expected value of the squared difference between the of the squared difference between the estimate and the actual value:estimate and the actual value:

Why square?Why square? Root Mean Square Error (RMSE)Root Mean Square Error (RMSE)

Jackknife EstimateJackknife Estimate Jackknife Estimate:Jackknife Estimate: estimate of parameter is estimate of parameter is

obtained by omitting one value from the set of obtained by omitting one value from the set of observed values.observed values.

Ex: estimate of mean for X={xEx: estimate of mean for X={x1, … , x, … , xn}}

Maximum Likelihood Maximum Likelihood Estimate (MLE)Estimate (MLE)

Obtain parameter estimates that maximize Obtain parameter estimates that maximize the probability that the sample data occurs for the probability that the sample data occurs for the specific model.the specific model.

Joint probability for observing the sample Joint probability for observing the sample data by multiplying the individual probabilities. data by multiplying the individual probabilities. Likelihood function: Likelihood function:

Maximize L.Maximize L.

Models Based on SummarizationModels Based on Summarization

Visualization:Visualization: Frequency distribution, mean, variance, Frequency distribution, mean, variance, median, mode, etc.median, mode, etc.

Box Plot:Box Plot:

Scatter DiagramScatter Diagram

Bayes TheoremBayes Theorem

Posterior Probability:Posterior Probability: P(hP(h1|x|xi)) Prior Probability:Prior Probability: P(h P(h1)) Bayes Theorem:Bayes Theorem:

Assign probabilities of hypotheses given a data Assign probabilities of hypotheses given a data value.value.

Hypothesis TestingHypothesis Testing

Find model to explain behavior by Find model to explain behavior by creating and then testing a hypothesis creating and then testing a hypothesis about the data.about the data.

Exact opposite of usual DM approach.Exact opposite of usual DM approach. HH0 0 – Null hypothesis; Hypothesis to be – Null hypothesis; Hypothesis to be

tested.tested. HH1 1 – Alternative hypothesis– Alternative hypothesis

Chi Squared StatisticChi Squared Statistic

O – observed valueO – observed value E – Expected value based on hypothesis.E – Expected value based on hypothesis.

Ex: Ex: – O={50,93,67,78,87}O={50,93,67,78,87}– E=75E=75– 22=15.55 and therefore significant=15.55 and therefore significant

RegressionRegression

Predict future values based on past Predict future values based on past valuesvalues

Linear RegressionLinear Regression assumes linear assumes linear relationship exists.relationship exists.

y = cy = c00 + c + c11 x x11 + … + c + … + cnn x xnn

Find values to best fit the dataFind values to best fit the data

Linear RegressionLinear Regression

CorrelationCorrelation

Examine the degree to which the values Examine the degree to which the values for two variables behave similarly.for two variables behave similarly.

Correlation coefficient r:Correlation coefficient r:• 1 = perfect correlation1 = perfect correlation• -1 = perfect but opposite correlation-1 = perfect but opposite correlation• 0 = no correlation0 = no correlation

Distance MeasuresDistance Measures

Measure dissimilarity between objectsMeasure dissimilarity between objects

Twenty Questions GameTwenty Questions Game

Decision Tree ExampleDecision Tree Example

<<<<……Thank U……>>>><<<<……Thank U……>>>>

For more details visit my site atFor more details visit my site at http://drsridhar.tripod.com

For your queries, email to me :For your queries, email to me : [email protected]

Reference Book onReference Book on

"Datamining" (ISBN 81-7758-785-4) "Datamining" (ISBN 81-7758-785-4)

48