data mining 1. 2 data mining extracting or “mining” knowledge from large amounts of data data...
TRANSCRIPT
DATA MINING
1
2
Data Mining Extracting or “mining” knowledge from large amounts of data
Data mining is the process of autonomously retrieving useful information or knowledge from large data stores or sets.
Data mining is a technique for searching large-scale
databases for patterns used mainly to find previously unknown correlations between variables.
Data Mining Motivation
Changes in the Business EnvironmentCustomers becoming more demanding
Markets are saturated
Databases today are huge:More than 1,000,000 entities/records/rows
From 10 to 10,000 fields/attributes/variables
Gigabytes and terabytes
Databases a growing at an unprecedented rate Decisions must be made rapidly Decisions must be made with maximum knowledge
3
4
Data Mining: Confluence of Multiple Disciplines
Data Mining
Database Technology
Statistics
MachineLearning
A.I.
AlgorithmOther
Disciplines
Visualization
5
VISULIZATION
The visual interpretation of complex relationships in multidimensional data. Graphics tools are used to illustrate data relationships.
Statistic
In data mining it is used for classifying and grouping things
Machine learning
the ability of a machine to improve its performance based on previous results.
Artificial Intelligence
the branch of computer science that deal with writing computer programs that can solve problems creatively
Algorithm
precise rule (or set of rules) specifying how to solve some problem
Why Not Traditional Data Analysis
Tremendous amount of data
High complexity of data
6
KDD is non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data
Data Mining is a step in KDD process consisting of particular data mining algorithms
7
Knowledge discovery in database
8
Data Mining (cont.)
Data Mining is a step of Knowledge Discovery in Databases (KDD) Process Data Warehousing Data Selection Data Preprocessing Data Transformation Data Mining Interpretation/Evaluation
Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms
Steps of a KDD Process
Learning the application domain: relevant prior knowledge and goals of application
data selection Creating a target data set: Data cleaning and preprocessing: Data reduction and transformation:
Find useful features, dimensionality/variable reduction, invariant representation.
Choosing functions of data mining summarization, classification, regression, association,
clustering.
9
Choosing the mining algorithm(s) Data mining: search for patterns of interest Pattern evaluation and knowledge presentation
visualization, transformation, removing redundant patterns, etc.
Use of discovered knowledge
10
DATA MININING EVALUTION
11
Data Integration
Databases
Data Warehouse
Task-relevant Data
Selection
Data Mining
Pattern Evaluation
Data Mining: On What Kind of Data?
Relational databases Data warehouses Transactional databases Advanced DB and information repositories
Object-oriented and object-relational databases Spatial databases Text databases and multimedia databases WWW
12
13
Data Mining Applications:
Banking: loan/credit card approvalpredict good customers based on old customers
Targeted marketing: identify likely responders to promotions
Fraud detection: telecommunications, financial transactionsfrom an online stream of event identify fraudulent events
14
Data Mining Applications: Medicine: disease outcome, effectiveness of
treatments analyze patient disease history: find relationship
between diseases Molecular/Pharmaceutical: identify new drugs Scientific data analysis:
identify new galaxies by searching for sub clusters Web site/store design and promotion:
find affinity of visitor to pages and
15
Financial Industry, Banks, Businesses, E-commerce
Stock and investment analysis Identify loyal customers vs. risky customer Predict customer spending Risk management Sales forecasting
16
Data Mining in CRM:Customer Life Cycle Customer Life Cycle
The stages in the relationship between a customer and a business
Key stages in the customer lifecycle Prospects: people who are not yet customers but are
in the target market Responders: prospects who show an interest in a
product or service Active Customers: people who are currently using
the product or service Former Customers: may be “bad” customers who did
not pay their bills or who incurred high costs
17
Data Mining in CRM
DM helps to Determine the behavior surrounding a particular
lifecycle event Find other people in similar life stages and
determine which customers are following similar behavior patterns
18
Data Mining in CRM (cont.)
Data Warehouse Data Mining
Campaign Management
Customer Profile
Customer Life Cycle Info.
19
Data Mining Techniques
Data Mining Techniques
Descriptive Predictive
Clustering
Association
Classification
Regression
Sequential Analysis
Decision Tree
Rule Induction
Neural Networks
Nearest Neighbor Classification
Predictive DM
20
Predictive data mining, which produces the model of the system described by the given data set
build models in order to estimate unknown values of interest.
Examples:
Given a customer’s characteristics a model predicts how much the customer will spend on the next catalog order.
21
Descriptive DM
Descriptive data mining, which produces new, nontrivial information based on the available data set.
Descriptive DM is used to learn about and understand the data. Example:
Identify and describe groups of customers with common buying behavior
22
Classification Classification is the process of sub-dividing a data set with regard to a number of specific outcomes.
Example
Given old data about customers and payments, predict new applicant’s loan eligibility.
23
Decision Trees
hair eyes class
brown blue A
brown brown B
red blue A
dark blue B
dark blue B
brown blue A
dark brown B
brown brown B
Tree where internal nodes are simple decision rules on one or more attributes and leaf nodes are predicted class labels
24
Decision Trees:Learned Predictive Rules
hair
eyesB
B
A
A
darkred
brown
blue brown
25
Rule induction
In rule induction action are given and we have to discover the rule.
The extraction of useful if-then rules from data based on statistical significance.
Rule induction is an area of machine learning in which formal rules are extracted from a set of observations.
Examples
Do not give the discount on 2 items that are frequently brought. use the discount on 1 to pull the others.
Send camcorder offer to VCR purchasers 2-3 months after VCR purchase.
26
NEUTAL NETWORK
Set of nodes connected by directed weighted edges
Useful for learning complex data like handwriting, speech and image recognition.
Neural networks have broad applicability to real world business problems and have already been successfully applied in many industries. Since neural networks are best at identifying patterns or trends in data, they are well suited for prediction or forecasting needs including:
27
NEAREST NEIGHBOUR MEHTOD
The nearest neighbor algorithm in pattern recognition is a method for classifying phenomena based upon observable features.
Define proximity between instances, find neighbors of new instance and assign majority class.
The nearest neighbor algorithm is a heuristic algorithm that is not guaranteed to produce a correct result in most cases.
29
Clustering The art of finding groups in data.
Objective: gather items from a database into sets according to (unknown) common characteristics.
Group existing customers based on time series of payment history such that similar customers in same cluster.
Key requirement: Need a good measure of similarity between instances.
Major issues in data mining
30
Mining different kinds of knowledge in databases.
Expression and visualization of data mining results.
Handling noise and incomplete data.
Pattern evaluation: the interestingness problem.
Efficiency and scalability of data mining algorithms.
Handling relational and complex types of data.