bas 250 lecture 2
TRANSCRIPT
BAS 250Lesson 2: Data Preparation
Explain concepts and purpose of Data Preparation
Understand solutions for handling missing and
inconsistent data
Utilize data and attribute reduction techniques
Effectively work in RapidMiner to prepare your data.
This Week’s Learning Objectives
The Data Mining Process: CRISP-DM
– Join data sets that are needed for your
analysis
– Reduce data sets to only include pertinent
variables
– Scrub data to remove anomalies- outliers or
missing data
– Reformat for consistency and effective use
3. Data Preparation
Ensure robustness of data
o Combine more 2 or more data sets to create a “mini –
database” with all variables needed for analysis in one place.
o Merge by a unique identifier common to both data sets
“Key Identifier”, “Common ID”, “ID Number”, etc.
Example: Social Security Number (links Medical and Insurance)
Data Preparation
Data PreparationExample: Sources of Data
Customer Purchases - “Point of Sale data” – CSV file format
Cost of Products Sold – “Accounting department” – Excel file format
Inventory of Products - “ IT Data Warehouse” - XML file format
Merge By Product ID or SKU
Data Reduction…two part
oObservations (rows, instances, etc)
o Attributes (variables, records, columns, etc)
Data Preparation
Attribute reduction to filter out irrelevant or uninteresting
data without completely removing them from the original
set.
Even if a variable isn’t interesting for answering some
questions, it may still be useful in others.
It is recommended to import all attributes first, then filter as necessary
Data Preparation
Observation Reduction…
Observation reduction is to reduce the # of observations to create a smaller
data set.
Some reasons to do so:
o Create a sample set for: Training data, proof of concept analysis, testing theories, sharing data
o Improve analysis speed or process time
o Data scrubbing for outliers, missing values, etc.
Data Preparation
Ensure consistency of data
oMissing information
o Spelling errors, typos
oMultiple responses for an attribute
oCharacters in numeric fields and vice-versa
Data Preparation
Ensure consistency of data
Data Preparation
KEY: Missing data is data that does not exist in a data set• Not the same as zero or some other value
• In a dataset, it is blank and the value is unknown
• Sometimes referred to as null values
• Depending on your objective and the circumstance, you may choose
to leave missing data as they are or replace with some other value
Ensure consistency of data
Data Preparation
KEY: Inconsistent data is different from missing data
• Occurs when a value does exist but its value is not valid
or meaningful.
• Common = “.” or “zero”
Ensure consistency of data
Data Preparation
Replace or remove missing or inconsistent data
• For numeric data…
• Can be replaced using Measures of Central Tendency
• Mean, Median, and Mode• Mean - Average value• Median - Middle value• Mode - Most frequent or common value
Ensure consistency of data
Data Preparation
Replace or remove missing or inconsistent data
• For character data…
• Can be replaced using Best Estimated Value
• “Like Others” • Ex. All males in data like bass fishing. If attribute “Fish Type” is blank and
attribute “Gender” equals male, then “Bass”
• “Clustering Techniques”
• “Best Guess”
Ensure consistency of data
Data Preparation
• Replacing missing or inconsistent values found in data
should be done:
• With intention, not haphazardly
• Use common sense
• Be transparent
It is recommended to always document your missing or consistent data processes.
This course is a practical application course in Data Mining. Learning to use
RapidMiner is required.
If you have not done so yet, please plan to walk through the tutorial examples in
RapidMiner.
To assist you in understanding RapidMiner, I will take screenshots of what I am
doing to get the results we are looking for.
RapidMiner is pretty intuitive. You will get it quickly.
Basics of RapidMiner
Types of files that can be imported into RapidMiner:
o CSV File
o Excel File
o XML File
o Access Database Table
o … and much more
We use mainly CSV files which contain Comma Separated Values- be mindful if your dataset
contains commas
o Alternative delimiters can be selected in this case: Tab
Semicolon
Pipe ( l ), etc.
Basics of RapidMiner
Three main areas that contain useful tools in
RapidMiner:
oOperators – Every possible task you can think of
oRepositories – Where you store your data
o Parameters – Task set up details
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Basics of RapidMiner
Explain concepts and purpose of Data Preparation
Understand solutions for handling missing and inconsistent data
Utilize data and attribute reduction techniques
Effectively work in RapidMiner to prepare your data.
Summary
“This workforce solution was funded by a grant awarded by the U.S. Department of Labor’s Employment and
Training Administration. The solution was created by the grantee and does not necessarily reflect the official
position of the U.S. Department of Labor. The Department of Labor makes no guarantees, warranties, or
assurances of any kind, express or implied, with respect to such information, including any information on linked
sites and including, but not limited to, accuracy of the information or its completeness, timeliness, usefulness,
adequacy, continued availability, or ownership.”
Except where otherwise stated, this work by Wake Technical Community College Building Capacity in Business
Analytics, a Department of Labor, TAACCCT funded project, is licensed under the Creative Commons Attribution
4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/
Copyright Information