data mining chun-hung chou [email protected]
TRANSCRIPT
What is Data Mining?What is Data Mining?
• Searching for knowledge(interesting patterns) in your data
• a process that uses a variety of data analysis tools to discover patterns and relationships in data.
• Uses tools from Computer Science and Artificial Intelligence as well as Statistics.
Why we need data mining?Why we need data mining?
– Large number of records (cases) (108-1012 bytes)– High dimensional data (variables) (10-104 attributes)– Only a small portion, typically 5% to 10%, of the
collected data is ever analyzed.– Data that may never be explored continues to be
collected out of fear that something that may prove important in the future may be missing.
– Magnitude of data precludes most traditional analysis ANOVA/PC/
Goals of Data MiningGoals of Data Mining
•Prediction using some variables or fields in the data set to predict
unknown or future values of other variables of interest
produce a model,expressed as an executable code, which can
be used to perform classification, prediction, estimation or
other similar tasks
•Description finding patterns describing the data that can be interpreted
by humans
understanding of the analyzed system by uncovering patterns
and relationships in large data sets
Procedure of Data MiningProcedure of Data Mining
Interpret the model & draw the conclusions
State the problem
Collect the data
Perform preprocessing
Estimate the model (mine the data)
State the problemState the problem
– domain-specific knowledge and experience are necessary in order to come up with a meaningful problem statement
– A close interaction between data mining expert and the application expert
– This cooperation does not stop in the initial phase; it continues during the entire data mining process
Collect the dataCollect the data
– Designed experiment data
the data-generation process is under the
control of an expert
– Observational approach
random data generation
Preprocessing the dataPreprocessing the data
• Outlier detection
a)Detect and eventually remove outliers as a part of the preprocessing phase
b)Develop robust modeling methods that are
insensitive to outliers
• Scaling,encoding and selecting
features a)variables with different scale
b)dimensionality reduction
Estimate the modelEstimate the model
• Selection and implementation of the
appropriate data mining technique
Interpret the model & draw Interpret the model & draw conclusionsconclusions
• Decision making
• Validate the result
Potential ApplicationsPotential Applications
– Fraud Detection – Manufacturing Processes – Targeting Markets – Scientific Data Analysis– Risk Management– Web Intelligence– Bioinformation– …...
Data Mining MythsData Mining Myths
• Data mining tools need no guidance.
• Data mining models explain behavior.
• Data mining requires no data analysis skill.
• Data mining eliminates the need to understand your business and your data
• Data mining tools are “different” from statistics.
Data Mining FunctionalitiesData Mining Functionalities
• Concept/Class Description
• Association Analysis
• Classification Analysis
• Cluster Analysis
• Outlier Analysis
• Evolution Analysis
Concept DescriptionConcept Description
Generate descriptions for characterization and
comparison of data
characterization :
summarizes and describes a collection of data
e.g. mean,distribution,percentile,..
comparison :
summarizes and distinguishes one collection of data from other
collection(s) of data
Association AnalysisAssociation Analysis
Goal: find interesting relationships among items in a given data set
Association AnalysisAssociation Analysis
Example:• Market Basket Analysis - An example of Rule-based
Machine Learning• Customer Analysis
– Market Basket Analysis uses the information about what a customer purchases to give us insight into who they are and why they make certain purchases
• Product Analysis– Market Basket Analysis gives us insight into the
merchandise by telling us which products tend to be purchased together and which are most amenable to purchase
Classification AnalysisClassification Analysis
Goal:
Build a model to describe a predetermined set of data
classes or concepts and use the model as prediction
Classification AnalysisClassification Analysis
Method: Decision Tree Bayesian network Bayesian belife network Neural network k-nearest neighbor case-based reasoning genetic algorithm rough sets fuzzy logic SVM/SOM ….
Cluster AnalysisCluster Analysis
Goal:
grouping a set of physical or abstract objects into classes
of similar objects
ClusterCluster
• Method:
Partitioning methods :k-means
Hierarchical methods :top-down,bottom-up
Density-based methods :arbitrary shapes
Grid-based methods :cells
Model-based methods :best fit of given model
Outlier AnalysisOutlier Analysis
Outlier: the data can be considered as
inconsistent in a given data set
Goal: find an efficient method to mine the
outliers
Outlier AnalysisOutlier Analysis
Method:
- Statistical-Based Outlier Detection
- Distance-Based Outlier Detection
- Deviation-Based Outlier Detection
Evolution AnalysisEvolution Analysis
• Goal:
Describe and models regularities or trends for
objects whose behavior changes over time
Evolution AnalysisEvolution Analysis
• Method:
Statistical Method
Trend Analysis
Similarity Search in Time-Series Analysis
Sequential Pattern Mining
Periodicity Analysis