role of maths sat in dm
TRANSCRIPT
Role of Mathematics / Statistics Role of Mathematics / Statistics in DATA MININGin DATA MINING
By By
Dr.S.Sridhar, Ph.D.(JNUD),Dr.S.Sridhar, Ph.D.(JNUD),RACI(Paris, NICE), RMR(USA), RZFM(Germany)RACI(Paris, NICE), RMR(USA), RZFM(Germany)
RIEEEProc., RIETCom., LMISTE, LMCSIRIEEEProc., RIETCom., LMISTE, LMCSIVice Chancellor-DSI-RAKVice Chancellor-DSI-RAK
Data Mining DefinitionData Mining Definition
Finding hidden information in a Finding hidden information in a databasedatabase
Fit data to a modelFit data to a model
Similar termsSimilar terms
Data Mining AlgorithmData Mining Algorithm
Objective: Fit Data to a ModelObjective: Fit Data to a Model– DescriptiveDescriptive– PredictivePredictive
Preference – Technique to choose the Preference – Technique to choose the best modelbest model
Search – Technique to search the dataSearch – Technique to search the data– ““Query”Query”
Database Processing vs. Data Database Processing vs. Data Mining ProcessingMining Processing
QueryQuery– Well definedWell defined– SQLSQL
QueryQuery– Poorly definedPoorly defined– No precise query languageNo precise query language
DataData– Operational dataOperational data
OutputOutput– PrecisePrecise– Subset of databaseSubset of database
DataData– Not operational dataNot operational data
OutputOutput– FuzzyFuzzy– Not a subset of databaseNot a subset of database
Query ExamplesQuery Examples DatabaseDatabase
Data MiningData Mining
– Find all customers who have purchased milkFind all customers who have purchased milk
– Find all items which are frequently purchased Find all items which are frequently purchased with milk. (association rules)with milk. (association rules)
– Find all credit applicants with last name of KUMAR.Find all credit applicants with last name of KUMAR.– Identify customers who have purchased more Identify customers who have purchased more than INR10,000 in the last month.than INR10,000 in the last month.
– Find all credit applicants who are poor credit Find all credit applicants who are poor credit risks. (classification)risks. (classification)– Identify customers with similar buying habits. Identify customers with similar buying habits. (Clustering)(Clustering)
Basic Data Mining TasksBasic Data Mining Tasks Classification Classification maps data into predefined groups maps data into predefined groups
or classesor classes– Supervised learningSupervised learning– Pattern recognitionPattern recognition– PredictionPrediction
RegressionRegression is used to map a data item to a real is used to map a data item to a real valued prediction variable.valued prediction variable.
Clustering Clustering groups similar data together into groups similar data together into clusters.clusters.– Unsupervised learningUnsupervised learning– SegmentationSegmentation– PartitioningPartitioning
Basic Data Mining Tasks Basic Data Mining Tasks (cont’d)(cont’d)
Summarization Summarization maps data into subsets with maps data into subsets with associated simple descriptions.associated simple descriptions.– CharacterizationCharacterization– GeneralizationGeneralization
Link AnalysisLink Analysis uncovers relationships among uncovers relationships among data.data.– Affinity AnalysisAffinity Analysis– Association RulesAssociation Rules– Sequential Analysis determines sequential Sequential Analysis determines sequential
patterns.patterns.
Ex: Time Series AnalysisEx: Time Series Analysis Example: Stock MarketExample: Stock Market Predict future valuesPredict future values Determine similar patterns over timeDetermine similar patterns over time Classify behaviorClassify behavior
Data Mining vs. KDDData Mining vs. KDD
Knowledge Discovery in Databases Knowledge Discovery in Databases (KDD):(KDD): process of finding useful process of finding useful information and patterns in data.information and patterns in data.
Data Mining:Data Mining: Use of algorithms to Use of algorithms to extract the information and patterns extract the information and patterns derived by the KDD process. derived by the KDD process.
KDD ProcessKDD Process
Selection:Selection: Obtain data from various sources. Obtain data from various sources. Preprocessing:Preprocessing: Cleanse data. Cleanse data. Transformation:Transformation: Convert to common format. Convert to common format.
Transform to new format.Transform to new format. Data Mining:Data Mining: Obtain desired results. Obtain desired results. Interpretation/Evaluation:Interpretation/Evaluation: Present results Present results
to user in meaningful manner.to user in meaningful manner.
Data Mining DevelopmentData Mining Development•Similarity Measures•Hierarchical Clustering•IR Systems•Imprecise Queries•Textual Data•Web Search Engines
•Bayes Theorem•Regression Analysis•EM Algorithm•K-Means Clustering•Time Series Analysis
•Neural Networks•Decision Tree Algorithms
•Algorithm Design Techniques•Algorithm Analysis•Data Structures
•Relational Data Model•SQL•Association Rule Algorithms•Data Warehousing•Scalability Techniques
KDD IssuesKDD Issues
Human InteractionHuman Interaction OverfittingOverfitting InterpretationInterpretation Visualization Visualization Large DatasetsLarge Datasets High DimensionalityHigh Dimensionality
KDD Issues (cont’d)KDD Issues (cont’d)
Multimedia DataMultimedia Data Missing DataMissing Data Irrelevant DataIrrelevant Data Noisy DataNoisy Data Changing DataChanging Data IntegrationIntegration ApplicationApplication
Data Mining MetricsData Mining Metrics
UsefulnessUsefulness Return on Investment (ROI)Return on Investment (ROI) AccuracyAccuracy Space/TimeSpace/Time
Database Perspective on Data Database Perspective on Data MiningMining
ScalabilityScalability Real World DataReal World Data UpdatesUpdates Ease of UseEase of Use
Visualization TechniquesVisualization Techniques
GraphicalGraphical GeometricGeometric Icon-basedIcon-based Pixel-basedPixel-based HierarchicalHierarchical
DB & OLTP SystemsDB & OLTP Systems SchemaSchema
– (ID,Name,Address,Salary,JobNo)(ID,Name,Address,Salary,JobNo) Data ModelData Model
– ERER– RelationalRelational
TransactionTransaction Query:Query:
SELECT NameSELECT NameFROM TFROM TWHERE Salary > 100000WHERE Salary > 100000
DM: Only imprecise queriesDM: Only imprecise queries
Fuzzy Sets and LogicFuzzy Sets and Logic Fuzzy Set:Fuzzy Set: Set membership function is a real valued Set membership function is a real valued
function with output in the range [0,1].function with output in the range [0,1]. f(x): Probability x is in F.f(x): Probability x is in F. 1-f(x): Probability x is not in F.1-f(x): Probability x is not in F. EX:EX:
– T = {x | x is a person and x is tall}T = {x | x is a person and x is tall}– Let f(x) be the probability that x is tallLet f(x) be the probability that x is tall– Here f is the membership functionHere f is the membership function
DM: DM: Prediction and classification are fuzzy.Prediction and classification are fuzzy.
Classification/Prediction is Classification/Prediction is FuzzyFuzzy
Loan
Amnt
Simple Fuzzy
Accept Accept
RejectReject
Information Retrieval Information Retrieval
Information Retrieval (IR):Information Retrieval (IR): retrieving desired retrieving desired information from textual data.information from textual data.
Library ScienceLibrary Science Digital LibrariesDigital Libraries Web Search EnginesWeb Search Engines Traditionally keyword basedTraditionally keyword based Sample query:Sample query:
Find all documents about “data mining”.Find all documents about “data mining”.
DM: Similarity measures; DM: Similarity measures; Mine text/Web data.Mine text/Web data.
IR Query Result Measures IR Query Result Measures and Classificationand Classification
IR Classification
Relational View of DataRelational View of Data
ProdID LocID Date Quantity UnitPrice 123 Dallas 022900 5 25 123 Houston 020100 10 20 150 Dallas 031500 1 100 150 Dallas 031500 5 95 150 Fort
Worth 021000 5 80
150 Chicago 012000 20 75 200 Seattle 030100 5 50 300 Rochester 021500 200 5 500 Bradenton 022000 15 20 500 Chicago 012000 10 25 1
Data WarehousingData Warehousing
““Subject-oriented, integrated, time-variant, nonvolatile” Subject-oriented, integrated, time-variant, nonvolatile” William InmonWilliam Inmon
Operational Data:Operational Data: Data used in day to day needs of Data used in day to day needs of company.company.
Informational Data:Informational Data: Supports other functions such as Supports other functions such as planning and forecasting.planning and forecasting.
Data mining tools often access data warehouses rather Data mining tools often access data warehouses rather than operational data.than operational data.
DM: May access data in warehouse.DM: May access data in warehouse.
StatisticsStatistics Simple descriptive modelsSimple descriptive models Statistical inference:Statistical inference: generalizing a model generalizing a model
created from a sample of the data to the entire created from a sample of the data to the entire dataset.dataset.
Exploratory Data Analysis:Exploratory Data Analysis: – Data can actually drive the creation of the Data can actually drive the creation of the
modelmodel– Opposite of traditional statistical view.Opposite of traditional statistical view.
Data mining targeted to business userData mining targeted to business user
DM: Many data mining methods come DM: Many data mining methods come from statistical techniques. from statistical techniques.
Pattern Matching Pattern Matching (Recognition)(Recognition)
Pattern Matching:Pattern Matching: finds occurrences of finds occurrences of a predefined pattern in the data.a predefined pattern in the data.
Applications include speech recognition, Applications include speech recognition, information retrieval, time series information retrieval, time series analysis.analysis.
DM: Type of classification.DM: Type of classification.
Data Mining Techniques OutlineData Mining Techniques Outline
StatisticalStatistical– Point EstimationPoint Estimation– Models Based on SummarizationModels Based on Summarization– Bayes TheoremBayes Theorem– Hypothesis TestingHypothesis Testing– Regression and CorrelationRegression and Correlation
Similarity MeasuresSimilarity Measures Decision TreesDecision Trees Neural NetworksNeural Networks
– Activation FunctionsActivation Functions
Genetic AlgorithmsGenetic Algorithms
Goal:Goal: Provide an overview of basic data Provide an overview of basic data mining techniquesmining techniques
Point EstimationPoint Estimation Point Estimate:Point Estimate: estimate a population estimate a population
parameter.parameter. May be made by calculating the parameter for a May be made by calculating the parameter for a
sample.sample. May be used to predict value for missing data.May be used to predict value for missing data. Ex: Ex:
– R contains 100 employeesR contains 100 employees– 99 have salary information99 have salary information– Mean salary of these is $50,000Mean salary of these is $50,000– Use $50,000 as value of remaining employee’s Use $50,000 as value of remaining employee’s
salary. salary. Is this a good idea?Is this a good idea?
Estimation ErrorEstimation Error
Bias: Bias: Difference between expected value and Difference between expected value and actual value.actual value.
Mean Squared Error (MSE):Mean Squared Error (MSE): expected value expected value of the squared difference between the of the squared difference between the estimate and the actual value:estimate and the actual value:
Why square?Why square? Root Mean Square Error (RMSE)Root Mean Square Error (RMSE)
Jackknife EstimateJackknife Estimate Jackknife Estimate:Jackknife Estimate: estimate of parameter is estimate of parameter is
obtained by omitting one value from the set of obtained by omitting one value from the set of observed values.observed values.
Ex: estimate of mean for X={xEx: estimate of mean for X={x1, … , x, … , xn}}
Maximum Likelihood Maximum Likelihood Estimate (MLE)Estimate (MLE)
Obtain parameter estimates that maximize Obtain parameter estimates that maximize the probability that the sample data occurs for the probability that the sample data occurs for the specific model.the specific model.
Joint probability for observing the sample Joint probability for observing the sample data by multiplying the individual probabilities. data by multiplying the individual probabilities. Likelihood function: Likelihood function:
Maximize L.Maximize L.
Models Based on SummarizationModels Based on Summarization
Visualization:Visualization: Frequency distribution, mean, variance, Frequency distribution, mean, variance, median, mode, etc.median, mode, etc.
Box Plot:Box Plot:
Bayes TheoremBayes Theorem
Posterior Probability:Posterior Probability: P(hP(h1|x|xi)) Prior Probability:Prior Probability: P(h P(h1)) Bayes Theorem:Bayes Theorem:
Assign probabilities of hypotheses given a data Assign probabilities of hypotheses given a data value.value.
Hypothesis TestingHypothesis Testing
Find model to explain behavior by Find model to explain behavior by creating and then testing a hypothesis creating and then testing a hypothesis about the data.about the data.
Exact opposite of usual DM approach.Exact opposite of usual DM approach. HH0 0 – Null hypothesis; Hypothesis to be – Null hypothesis; Hypothesis to be
tested.tested. HH1 1 – Alternative hypothesis– Alternative hypothesis
Chi Squared StatisticChi Squared Statistic
O – observed valueO – observed value E – Expected value based on hypothesis.E – Expected value based on hypothesis.
Ex: Ex: – O={50,93,67,78,87}O={50,93,67,78,87}– E=75E=75– 22=15.55 and therefore significant=15.55 and therefore significant
RegressionRegression
Predict future values based on past Predict future values based on past valuesvalues
Linear RegressionLinear Regression assumes linear assumes linear relationship exists.relationship exists.
y = cy = c00 + c + c11 x x11 + … + c + … + cnn x xnn
Find values to best fit the dataFind values to best fit the data
CorrelationCorrelation
Examine the degree to which the values Examine the degree to which the values for two variables behave similarly.for two variables behave similarly.
Correlation coefficient r:Correlation coefficient r:• 1 = perfect correlation1 = perfect correlation• -1 = perfect but opposite correlation-1 = perfect but opposite correlation• 0 = no correlation0 = no correlation
Distance MeasuresDistance Measures
Measure dissimilarity between objectsMeasure dissimilarity between objects
<<<<……Thank U……>>>><<<<……Thank U……>>>>
For more details visit my site atFor more details visit my site at http://drsridhar.tripod.com
For your queries, email to me :For your queries, email to me : [email protected]
Reference Book onReference Book on
"Datamining" (ISBN 81-7758-785-4) "Datamining" (ISBN 81-7758-785-4)
48