Download - Chapter 1 : Introduction to KDD
![Page 1: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/1.jpg)
Chapter 1 :
![Page 2: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/2.jpg)
What is Knowledge Acquisitions ?
aka :: data mining, knowledge discovery, knowledge extraction, information discovery, information harvesting ect.
Process of discovering useful information,hidden pattern or rules in large quantities of data ( non-trivial, unknown data)
By automatic or semiautomatic means It’s impossible to find pattern using manual
method.
![Page 3: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/3.jpg)
Why Knowledge Acquisitions ?
![Page 4: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/4.jpg)
Why Knowledge Acquisitions ? Why?
Data explosion (tremendous amount of data available) Data is being warehoused Computing power Competitive pressure
Hard Disk Nowadays more than 100Ggbytes capacities
![Page 5: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/5.jpg)
Is Data Mining Appropriate for My problem ? Four general question to consider
Can we clearly define the problem? Does potentially meaningful data exist? Does the data contain hidden knowledge or is
the data factual and useful for reporting purpose only?
Will the cost of processing the data be less than the likely increase in profit seen by applying any potential knowledge gained from the data mining project.
![Page 6: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/6.jpg)
Traditional Approaches Traditional database queries:. Access a
database using a well defined query such as SQL
The query output consist of data from database
The output usually a subset of the database
DBMSDB
SQL
![Page 7: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/7.jpg)
Data Mining or Data Query Four general types of knowledge can be
define to help us determine when data mining is appropriate.Shallow KnowledgeMultidimensional KnowledgeHidden KnowledgeDeep Knowledge
![Page 8: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/8.jpg)
Shallow Knowledge
Factual in nature Can be easily stored and manipulated in a
database Database query language such as SQL
are excellent tools for extracting shallow knowledge from data
![Page 9: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/9.jpg)
Multidimensional Knowledge
also Factual Data are stored in a multidimensional
format On-line Analytical Processing (OLAP)
tools are used on multidimensional data
![Page 10: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/10.jpg)
Hidden Knowledge
Patterns or regularities in data that cannot be easily found using database query language such as SQL
Data mining algorithms can find such patterns with ease.
![Page 11: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/11.jpg)
Deep Knowledge
Knowledge stored in database that can only be found if we are given some direction about what we are looking for.
Current data mining tools are not able to locate deep knowledge.
![Page 12: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/12.jpg)
What can computers learn?• Four level of learning can be differentiated
(Merril & Tennyson, 1977) : Facts : simple statement of truth Concepts : set of objects, symbols, or events grouped
together because they share certain characteristics Procedures: step by step course of action to achieve a
goal. Principles: highest level of learning. General truth or
laws that are basic to other truths.
![Page 13: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/13.jpg)
What can computers learn?• Computer are good at learning ‘concepts’.• Concepts are the output of data mining
session.• There are three (3) common concept view:
a. Classical viewb. Probabilistic viewc. Exemplar View
![Page 14: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/14.jpg)
Three Concept Viewsa. Classical View:• Definite defining properties• These properties determine if an individual item is an
example of a particular concept.• Crisp and leaves no room for misinterpretation.• Example: Good Credit Rating
IF Annual Income >= 30,000& Years at Current Position >= 5& Owns Home = TrueTHEN Good Credit Risk = True
![Page 15: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/15.jpg)
Three Concept Viewsb. Probabilistic View:• Concepts are represented by properties that are probable of concept
member.• Assumption is that people store and recall concept as generalization created
from individual instance observation.• Cannot be directly applied to achieve answer – but can be used to help in
decision making process.• Associate probability of membership with a specific
classification.
![Page 16: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/16.jpg)
- The mean annual income for individuals who consistently make loan payments on time is $30,000- Most individuals who are good credit risks have been working for the same company for at least five years.- The majority of good credit risks own their own home
Three Concept Viewsb. Probabilistic View:• Example: Good Credit Rating
Home owner with an annual income of $27000, employed at the same position for 4 years might be classified as a good credit risk with a probability of 0.85
![Page 17: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/17.jpg)
Three Concept Viewsc. Exemplar View:• A given instance is determine to be an example of a particular
concept if the instance is similar enough to a set of one or more known examples of the concept .
• Assumption is that people store and recall likely concept exemplars that are then used to classify new instances.
• Can associate a probability of concept membership with each classification.
![Page 18: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/18.jpg)
Three Concept Viewsc. Exemplar View:• Example:
Exemplar #1: Annual Income = 32,000 Number of years at current position = 6 Homeowner
Exemplar #2: Annual Income = 52,000 Number of years at current position = 16 Renter
Exemplar #1: Annual Income = 28,000 Number of years at current position = 12 Homeowner
![Page 19: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/19.jpg)
What can be mined?
![Page 20: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/20.jpg)
Concepts that can be mined?
a. Classes :• stored data is used to locate data in
predetermined groups.• Eg: A restaurant chain could mine
customer purchase data to determine when customers visit and what they typically order.
![Page 21: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/21.jpg)
Concepts that can be mined?
b. Clusters :• Data items are grouped by logical
relationships.• Eg: Data can be mined to identify market
segments or customer affinities.
![Page 22: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/22.jpg)
Concepts that can be mined?
c. Associations :• Data can be mined to identify
association.• Eg: The beer-diaper example is typical of
associative mining.
![Page 23: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/23.jpg)
Concepts that can be mined?
d. Sequential :• Patterns in which data is mined to
anticipate behavior patterns and trends.• Eg: An outdoor equipment retailer could
predict the likelihood of a backpack purchase based on sleeping bag or hiking shoes sale.
![Page 24: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/24.jpg)
Multidisciplinary
Databases
Statistics
PatternRecognition
KDD
MachineLearning AI
Neurocomputing
Data Mining
![Page 25: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/25.jpg)
Disciplines Of Data Mining
Data Mining
Information RetrivalAlgorithm
Machine Learning Visualization
StatisticsDatabase System
![Page 26: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/26.jpg)
Data Mining Model & Task
Data Mining
Predictive Descriptive
•Classification•Regression•Time Series Analysis•Prediction
•Clustering•Summarization•Association Rules•Sequence Discovery
![Page 27: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/27.jpg)
Predictive Model Make prediction about values of data using
known results found from different data Or based on the use of other historical data Example:: credit card fraud, breast cancer
early warning, terrorist act, tsunami and ect.
![Page 28: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/28.jpg)
Predictive Model Perform inference on the current data to make
predictions. We know what to predict based on historical data) Never accurate 100% Concentrate more to input output relation ship
( x,f(x)) Typical Question
Which costumer are likely to buy this product next four month
What kind of transactions that are likely to be fraudulent
Who is likely to drop this paper?
![Page 29: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/29.jpg)
Predictive Model
xx x
xx
x
xx
x
x
x
x xx
xx
months
Profit (RM)
Current data
Future dataO ?
![Page 30: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/30.jpg)
Descriptive Model Identifies pattern or relationships in data. Serves as a way to explore the properties of
data examined, not to predict new properties Always required a domain expert Example::
Segmenting marketing area Profiling student performances
![Page 31: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/31.jpg)
Descriptive Model Discovering new patterns inside the data We may don’t have any idea how the data looks like Explores the properties of the data examined Pattern at various granularities (eg: Student:
University-> faculty->program-> major? Typical Question
What is the data What does it look like What does the data suggest for group of customer
advertisement?
![Page 32: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/32.jpg)
Descriptive Model
major
Results
xx x
x
x
x
xx
xx
o
o
oooo
oo
o
o
o
oo
oo
o
yy
y
yy y
yy yy
yy y
y
y
Group 1
Group 2
Group 3
![Page 33: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/33.jpg)
View Of DM Data To Be Mined
Data warehouse, WWW, time series, textual. spatial multimedia, transactional
Knowledge To Be Mined Classification, prediction, summarization, trend
Techniques Utilized Database, machine learning, visualization, statistics
Applications Adapted Marketing, demographic segmentation, stock
analysis
![Page 34: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/34.jpg)
DM In Action Medical Applications ::clinical diagnosis, drug analysis Business (marketing segmentation & strategies,
insolvency predictor, loan risk assessment Education (Online learning) Internet (searching engine) Etc.
![Page 35: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/35.jpg)
Data Mining Methodology Hypothesis Testing vs Knowledge Discovery
Hypothesis Testing Top down approach Attempts to substantiate or disprove preconceived idea
Knowledge Discovery Bottom-up approach Start with data and tries to get it to tell us something we
didn’t already know
![Page 36: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/36.jpg)
Data Mining Methodology Hypothesis Testing
Generate good ideas Determine what data allow these hypotheses
to be tested Locate the data Prepare the data for analysis Build computer models based on the data Evaluate computer model to confirm or reject
hypotheses
![Page 37: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/37.jpg)
Data Mining Methodology Knowledge Discovery
Directed Identified sources of pre classified data Prepare data analysis Select appropriated KD techniques based on data
characteristics and data mining goal Divide data into training, testing and evaluation Use the training dataset to build model Tune the model by applying it to test dataset Take action based on data mining results Measure the effect of the action taken Restart the DM process taking advantage of new data
generated by the action taken
![Page 38: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/38.jpg)
Data Mining Methodology Knowledge Discovery
Undirected Identified available data sources Prepare data analysis Select appropriated undirected KD techniques based
on data characteristics and data mining goal Use the selected technique to uncover hidden
structure in the data Identify potential targets for directed KD Generate new hypothesis to test
![Page 39: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/39.jpg)
Question for Group Discussion
![Page 40: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/40.jpg)
Revision::Two Approaches In data Mining
Data Mining
Predictive Descriptive
•Classification•Regression•Time Series Analysis•Prediction
•Clustering•Summarization•Association Rules•Sequence Discovery
Predict the future value Define R/S among data
![Page 41: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/41.jpg)
Knowledge Discovery Process
![Page 42: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/42.jpg)
Knowledge Discovery Process
1.0 Selection The data needs for the data mining process
may be obtained from many different and heterogeneous data sources
Examples Business Transactions Scientific Data Video and pictures
![Page 43: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/43.jpg)
Knowledge Discovery Process
2.0 Pre Processing Main idea – to ensure that data is clean (high quality of
data). The data to be used by the process may have
incorrect or missing data. There may be anomalous data from multiple
sources involving different data types and metrics
Erroneous data may be corrected or removed, whereas missing data must be supplied or predicted (Often using data mining tools)
![Page 44: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/44.jpg)
Knowledge Discovery Process
3.0 Transformation Data from different sources must be converted
into a common format for processing Some data may be encoded or transformed
into more usable formats Example::
Data Reduction Data Cleaning, Data Integration, Data Transformation, Data Reduction and Data Discretization
![Page 45: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/45.jpg)
Knowledge Discovery Process
4.0 Data Mining Main idea –to use intelligent method to extract
patterns and knowledge from database This step applies algorithms to the transformed
data to generate the desired results. The heart of KD process (where unknown pattern will
be revealed). Example of algorithms: Regression
(classification, prediction), Neural Networks (prediction, classification, clustering), Apriori Algorithms (association rules), K-Means & K-Nearest Neighbor (clustering), Decision Tree (classification), Instance Learning (classification).
![Page 46: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/46.jpg)
Knowledge Discovery Process
5.0 Interpretation/Evaluation How the data mining results are presented to
the users is extremely important because the usefulness of the results is dependent on it
Example:: Graphical Geometric Icon Based Pixel Based Hierarchical Based Hybrid
![Page 47: Chapter 1 : Introduction to KDD](https://reader031.vdocuments.us/reader031/viewer/2022013105/54b418514a79597c418b45e7/html5/thumbnails/47.jpg)
Case Study: Predicting FSK Final Year’s Student Performance
activities
Student database {contains 30,000 records}
Academics
academics
Selected record {matric, PMK, grades} – only 2,000 records (contains incomplete records etc.
Selection
academics
Clean record {replace the missing value, removed the replicated}
Pre-processing Using neural networks : transform into numerical.
Transformation
Y=w1x1+w2x2+b1
Generated Model : pattern for performance prediction
Data mining
Testing result: 90 % correct
accept model
Knowledge (apply model)
Interpretation & evaluation