1 some issues concerning data mining muhammad ali yousuf itm (based on notes by david squire, monash...
TRANSCRIPT
![Page 1: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/1.jpg)
1
Some Issues Concerning Data Mining
Muhammad Ali Yousuf
ITM
(Based on Notes by David Squire, Monash University)
![Page 2: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/2.jpg)
2
Contents
Data Mining for Business Decision Support
The process of knowledge discovery Data selection and preprocessing A Case Study Data Preparation Data Modeling
![Page 3: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/3.jpg)
3
Data Mining for Business Decision Support (From Berry & Linoff 1997)
Identify the business problem Use data mining techniques to transform the
data into actionable information Act on information Measure the results
![Page 4: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/4.jpg)
4
The Process of Knowledge Discovery
Pre-processing data selection cleaning coding
Data Mining select a model apply the model
Analysis of results and assimilation Take action and measure the results
![Page 5: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/5.jpg)
5
The Process of Knowledge Discovery
Data Cleaning & Enrichment
Coding Data mining Reportingselection
-domain consistency- clustering
- segmentation-de-duplication - prediction-disambiguation
Requirement Action
Feedback
Operational data External data
The Knowledge Discovery in Databases (KDD) process (Adriaans/Zantinge)
Information Requirement
![Page 6: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/6.jpg)
6
Data Selection
Identify the relevant data, both internal and external to the organization
Select the subset of the data appropriate for the particular data mining application
Store the data in a database separate from the operational systems
![Page 7: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/7.jpg)
7
Data Preprocessing
Cleaning Domain consistency: replace certain values with null De-duplication: customers are often added to the DB
on each purchase transaction Disambiguation: highlighting ambiguities for a
decision by the user e.g. if names differed slightly but addresses were the
same
![Page 8: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/8.jpg)
8
Data Preprocessing
Enrichment Additional fields are added to records from external sources
which may be vital in establishing relationships. Coding
e.g. take addresses and replace them with regional codes e.g. transform birth dates into age ranges
It is often necessary to convert continuous data into range data for categorization purposes.
![Page 9: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/9.jpg)
9
Data Mining
Preliminary Analysis Much interesting information can be found by
querying the data set May be supported by a visualization of the
data set. Choose one or more modeling approaches
![Page 10: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/10.jpg)
10
Data Mining
There are two styles of data mining Hypothesis testing Knowledge discovery
The styles and approaches are not mutually exclusive
![Page 11: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/11.jpg)
11
Data Mining Tasks Various taxonomies exist. Berry & Linoff
define 6 tasks:ClassificationEstimationPredictionAffinity GroupingClusteringDescription
![Page 12: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/12.jpg)
12
Data Mining Tasks The tasks are also referred to as
operations:Predictive ModelingDatabase SegmentationLink AnalysisDeviation Detection
![Page 13: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/13.jpg)
13
Classification
Classification involves considering the features of some object then assigning it it to some pre-defined class, for example: Spotting fraudulent insurance claims Which phone numbers are fax numbers Which customers are high-value
![Page 14: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/14.jpg)
14
Estimation
Estimation deals with numerically valued outcomes rather than discrete categories as occurs in classification. Estimating the number of children in a family Estimating family income
![Page 15: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/15.jpg)
15
Prediction
Essentially the same as classification and estimation but involves future behaviour
Historical data is used to build a model explaining behaviour (outputs) for known inputs
The model developed is then applied to current inputs to predict future outputs Predict which customers will respond to a
promotion Classifying loan applications
![Page 16: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/16.jpg)
16
Affinity Grouping
Affinity grouping is also referred to as Market Basket Analysis
A common example is which items are bought together at the supermarket. Once this is known, decisions can be made on, for example: how to arrange items on the shelves which items should be promoted together
![Page 17: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/17.jpg)
17
Clustering
Clustering is also sometimes referred to as segmentation (though this has other meanings in other fields)
In clustering there are no pre-defined classes. Self-similarity is used to group records. The user must attach meaning to the clusters formed
![Page 18: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/18.jpg)
18
Clustering
Clustering often precedes some other data mining task, for example: once customers are separated into clusters, a
promotion might be carried out based on market basket analysis of the resulting cluster
![Page 19: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/19.jpg)
19
Description
A good description of data can provide understanding of behaviour
The description of the behaviour can suggest an explanation for it as well
Statistical measures can be useful in describing data, as can techniques that generate rules
![Page 20: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/20.jpg)
20
Deviation Detection
Records whose attributes deviate from the norm by significant amounts are also called outliers
Application areas include: fraud detection quality control tracing defects.
Visualization techniques and statistical techniques are useful in finding outliers
A cluster which contains only a few records may in fact represent outliers
![Page 21: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/21.jpg)
21
Data Mining Techniques Query tools Decision Trees Memory-Based Reasoning Artificial Neural Networks Genetic Algorithms Association and sequence detection Statistical Techniques Visualization Others (Logistic regression,Generalized Additive
Models (GAM), Multivariate Adaptive Regression Splines (MARS), K Means Clustering, ...)
![Page 22: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/22.jpg)
22
Data Mining and the Data Warehouse Organizations realized that they had large
amounts of data stored (especially of transactions) but it was not easily accessible
![Page 23: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/23.jpg)
23
Data Mining and the Data Warehouse The data warehouse provides a convenient data
source for data mining. Some data cleaning has usually occurred. It exists independently of the operational systems Data is retrieved rather than updated Indexed for efficient retrieval Data will often cover 5 to 10 years
A data warehouse is not a pre-requisite for data mining
![Page 24: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/24.jpg)
24
Data Mining and OLAP
Online Analytic Processing (OLAP) Tools that allow a powerful and efficient
representation of the data Makes use of a representation known as a cube A cube can be sliced and diced OLAP provide reporting with aggregation and
summary information but does not reveal patterns, which is the purpose of data mining
![Page 25: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/25.jpg)
25
A Case Study - Data Preparation(Cabena et al. page 106)
Health Insurance Commission Australia 550Gb online; 1300Gb in 5 year history DB Aim to prevent fraud and inappropriate practice Considered 6.8 million visits requesting up to 20 pathology
tests and 17,000 doctors Descriptive variables were added to the GP records Records were pivoted to create separate records for each
pathology test Records were then aggregated by provider number (GP) An association discovery operation was carried out
![Page 26: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/26.jpg)
26
Data PreparationAccessing dataData characterizationData selectionUseful operations for data clean-up and conversion Integration Issues
![Page 27: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/27.jpg)
27
Data Preparation for Data Mining
Before starting to use a data mining tool, the data has to be transformed into a suitable form for data mining
Many new and powerful data mining tools have become available in recent years, but the law of GIGO still applies:
Garbage In Garbage Out
Good data is a prerequisite for producing effective models of any type
![Page 28: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/28.jpg)
28
Data Preparation for Data Mining
Data preparation and data modeling can therefore be considered as setting up the proper environment for data mining
Data preparation will involve Accessing the data (transfer of data from various
sources) Integrating different data sets Cleaning the data Converting the data to a suitable format
![Page 29: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/29.jpg)
29
Accessing the Data
Before data can be identified and assessed, two major questions must be answered: Is the data accessible? How does one get it?
![Page 30: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/30.jpg)
30
Accessing the Data
There are many reasons why data might not be readily accessible, particularly in organizations without a data warehouse: legal issues departmental access political reasons data format connectivity architectural reasons timing
![Page 31: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/31.jpg)
31
Accessing the Data
Transferring from original sources may have to access from: high density tapes,
email attachments, FTP as bulk downloads
![Page 32: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/32.jpg)
32
Accessing the Data
Repository types Databases
Obtain data as separate tables converted to flat files (most databases have the facility).
Word processors Text output without any formatting would be the
best
![Page 33: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/33.jpg)
33
Accessing the Data
Repository types (cont.) Spreadsheets
Small applications/organisations will store data in spreadsheets. Already in row/column format, so easy to access. Most problems due to inconsistent replications
Machine to Machine Problems due to different computing
architectures
![Page 34: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/34.jpg)
34
Data Characterization
After obtaining all the data streams, the nature of each data stream must be characterized This is not the same as the data format (i.e.
field names and lengths)
![Page 35: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/35.jpg)
35
Data Characterization
Detail/Aggregation Level (Granularity) all variables fall somewhere between detailed
(e.g. transaction records) and aggregated (e.g. summaries)
in general, detailed data is preferred for data mining
the level of available in a data set determines the level of detail that is possible in the output
usually the level of detail of the input stream must be at least one level below that required of the output stream
![Page 36: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/36.jpg)
36
Data Characterization
Consistency Inconsistency can defeat any modeling
technique until it is discovered and corrected different things may have the same name in
different systems the same thing may be represented by different
names in different systems inconsistent data may be entered in a field in a
single system, e.g. auto_type:
Merc, Mercedes, M-Benz, Mrcds
![Page 37: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/37.jpg)
37
Data Characterization
PollutionData pollution can come from many
sources. One of the most common is when users attempt to stretch a system beyond its intended functionality, e.g.
“B” in a gender field, intended to represent “Business”. Field was originally intended to only even be “M” or “F”.
![Page 38: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/38.jpg)
38
Data Characterization
PollutionOther sources include:
copying errors (especially when format incorrectly specified)
human resistance - operators may enter garbage if they can’t see why they should have to type in all this “extra” data
![Page 39: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/39.jpg)
39
Data Characterization
Objects precise nature of object being measured by the
data must be understood e.g. what is the difference between “consumer
spending” and “consumer buying patterns”?
![Page 40: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/40.jpg)
40
Data Characterization
Domain Every variable has a domain: a range of
permitted values Summary statistics and frequency counts
can be used to detect erroneous values outside the domain
Some variables have conditional domains, violations of which are harder to detect
e.g. in a medical database a diagnosis of ovarian cancer is conditional on the gender of the patient being female
![Page 41: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/41.jpg)
41
Data Characterization
Default values if the system has default values for fields, this must be
known. Conditional defaults can create apparently significant patterns which in fact represent a lack of data
Integrity Checking integrity evaluates the relationships permitted
between variables e.g. an employee may have multiple cars, but is unlikely
to be allowed to have multiple employee numbers related to the domain issue
![Page 42: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/42.jpg)
42
Data Characterization
Duplicate or redundant variables redundant data can easily result from the
merging of data streams occurs when essentially identical data appears in
multiple variables, e.g. “date_of_birth”, “age” if not actually identical, will still slow building of
model if actually identical can cause significant
numerical computation problems for some models - even causing crashes
![Page 43: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/43.jpg)
43
Extracting Part of the Available Data
In most cases original data sets would be too large to handle as a single entity. There are two ways of handling this problem: Limit the scope of the the problem
concentrate on particular products, regions, time frames, dollar values etc. OLAP can be used for such limiting
if no pre-defined ideas exist, use tools such as Self-Organising Neural Networks to obtain an initial understanding of the structure of the data
Obtain a representative sample of the data Similar to statistical sampling
![Page 44: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/44.jpg)
44
Extracting Part of the Available Data
Once an entity of interest is identified via initial analysis, one can follow the lead and request more information (“walking the data”)
![Page 45: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/45.jpg)
45
Process of Data Access
Some problems one may encounter: copyright, security, limited front-end menu facilities
Datasource
Query data source
Obtain sample
Temporaryrepository
Apply filters
Clustering
refining
Data Mining Tool
Request for updates
![Page 46: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/46.jpg)
46
Some Useful Operations During Data Access / Preparation
Capitalization convert all text to upper- or lowercase. This helps
to avoid problems due to case differences in different occurrences of the same data (e.g. the names of people or organizations
![Page 47: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/47.jpg)
47
Some Useful Operations During Data Access / Preparation
Concatenation combine data spread across multiple fields e.g.
names, addresses. The aim is to produce a unique representation of the data object
![Page 48: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/48.jpg)
48
Some Useful Operations During Data Access / Preparation
Representation formats some sorts of data come in many formats
e.g. dates - 12/05/93, 05 - Dec- 93 transform all to a single, simple format
![Page 49: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/49.jpg)
49
Some Useful Operations During Data Access / Preparation
Augmentation remove extraneous characters e.g. !&%$#@ etc.
Abstraction it can sometimes be useful to reduce the
information in a field to simple yes/no values: e.g. flag people as having a criminal record rather than having a separate category for each possible crime
![Page 50: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/50.jpg)
50
Some Useful Operations During Data Access / Preparation
Unit conversion choose a standard unit for each field and enforce
it: e.g. yards, feet -> metres
![Page 51: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/51.jpg)
51
Some Useful Operations During Data Access / Preparation Exclusion
data processing takes up valuable computation time, so one should exclude unnecessary or unwanted fields where possible
fields containing bad, dirty or missing data may also be removed
![Page 52: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/52.jpg)
52
Some Data Integration Issues
Multi source Oracle, FoxPro, Excel, Informix etc. ODBC / DW helps
Multiformat relational databases, hierarchical structures, XML, HTML,
free text, etc. Multiplatform
DOS, UNIX, etc. Multisecurity
copyright, personal records, government data, etc.
![Page 53: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/53.jpg)
53
Some Data Integration Issues
Multimedia text, images, audio, video, etc. Cleaning might be required when inconsistent
![Page 54: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/54.jpg)
54
Some Data Integration Issues
Multilocation LAN, WAN, dial-up connections, etc.
Multiquery whether query format is consistent across data
sets again, database drivers useful here
whether multiple extractions are possible i.e. whether large number of extractions possible
- some systems do not allow batch extractions, have to obtain records individually, etc.
![Page 55: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/55.jpg)
55
Data ModelingMotivationTen Golden RulesObject modelingData AbstractionWorking with Meta Data
![Page 56: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/56.jpg)
56
Modeling Data for Data Mining
A major reason for preparing data is so that mining can discover models
What is modeling? it is assumed that the data set (available or obtainable)
contains information that would be of interest if only we could understand what was in it
Since we don’t understand the information that is in the data just by looking at it, some tool is needed which will turn the information lurking in the data set into an understandable form
![Page 57: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/57.jpg)
57
Modeling Data for Data Mining
Object is to transfer the raw data structure to a format that can be used for mining
![Page 58: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/58.jpg)
58
Modeling Data for Data Mining
The models created will determine the type of results that can be discovered during the analysis
![Page 59: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/59.jpg)
59
Modeling Data for Data Mining
With most current data mining tools, the analyst has to have some idea what type of patterns can be identified during the analysis, and model the data to suit these requirements
![Page 60: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/60.jpg)
60
Modeling Data for Data Mining
If the data is not properly modeled, important patterns may go undetected, thus undermining the likelihood of success
![Page 61: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/61.jpg)
61
Modeling Data for Data Mining
To make a model is to express the relationships governing how a change in a variable or set of variables (inputs) affects another variable or set of variables (outputs)
we also want information about the reliability of these relationships
the expression of the relationships may have many forms: charts, graphs, equations, computer programs
![Page 62: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/62.jpg)
62
Ten Golden Rules for Building Models1. Select clearly defined problems that will yield
tangible benefits
2. Specify the required solution
3. Define how the solution is going to be used
4. Understand as much as possible about the problem and the data set (the domain)
5. Let the problem drive the modeling (i.e. tool selection, data preparation, etc.)
![Page 63: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/63.jpg)
63
Ten Golden Rules for Building Models6. State any assumptions
7. Refine the model iteratively
8. Make the model as simple as possible - but no simpler (paraphrasing Einstein)
9. Define instability in the model (critical areas where change in output is very large for small changes in inputs)
10. Define uncertainty in the model (critical areas and ranges in the data set where the model produces low confidence predictions/insights)
![Page 64: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/64.jpg)
64
Object Modeling
The main approach to data modeling assumes an object-oriented framework, where information is represented as objects, their descriptive attributes, and relationships that exist between object classes.
![Page 65: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/65.jpg)
65
Object Modeling
Examples object classes Credit ratings of customers can be checked Contracts can be renewed Telephone calls can be billed
![Page 66: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/66.jpg)
66
Object Modeling
Identifying attributes In a medical database system, the class
patient may have the attributes height, weight, age, gender, etc.
![Page 67: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/67.jpg)
67
Data Abstraction
Information can be abstracted such that the analyst can initially get an overall picture of the data and gradually expand in a top-down manner
Will also permit processing of more data Can be used to identify patterns that can only be seen in
grouped data, e.g. group patients into broad age groups (0-10, 10-20, 20-30, etc.)
Clustering can be used to fully or partially automate this process
![Page 68: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/68.jpg)
68
Working With Metadata
Traditional definition of metadata is “data about data” Some data miners include “data within data” in the
definition Example: Deriving metadata from dates:
identifying seasonal sales trends identifying pivot points for some activity
e.g. happens on the 2nd Sunday of July Note: “July 4th, 1976” is potentially:
7th MoY, 4th DoM, 1976, Sunday, 1st DoW, 186 DoY, 1st Qtr FY etc.
![Page 69: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/69.jpg)
69
Working With Metadata
Metadata can also be derived from ID numbers passport numbers driving licence numbers post codes etc.
![Page 70: 1 Some Issues Concerning Data Mining Muhammad Ali Yousuf ITM (Based on Notes by David Squire, Monash University)](https://reader035.vdocuments.us/reader035/viewer/2022062407/56649e155503460f94b0008f/html5/thumbnails/70.jpg)
70
Working With Metadata
data can be modeled to make use of these Example: Metadata derived from addresses and
names identify the general make up of a shop’s clients
e.g. correlate addresses with map data to determine the distance customers travel to come to the shop
The End