bas 250 lecture 2

BAS 250Lesson 2: Data Preparation

Explain concepts and purpose of Data Preparation

Understand solutions for handling missing and

inconsistent data

Utilize data and attribute reduction techniques

Effectively work in RapidMiner to prepare your data.

This Week’s Learning Objectives

The Data Mining Process: CRISP-DM

– Join data sets that are needed for your

analysis

– Reduce data sets to only include pertinent

variables

– Scrub data to remove anomalies- outliers or

missing data

– Reformat for consistency and effective use

3. Data Preparation

Ensure robustness of data

o Combine more 2 or more data sets to create a “mini –

database” with all variables needed for analysis in one place.

o Merge by a unique identifier common to both data sets

“Key Identifier”, “Common ID”, “ID Number”, etc.

Example: Social Security Number (links Medical and Insurance)

Data Preparation

Data PreparationExample: Sources of Data

Customer Purchases - “Point of Sale data” – CSV file format

Cost of Products Sold – “Accounting department” – Excel file format

Inventory of Products - “ IT Data Warehouse” - XML file format

Merge By Product ID or SKU

Data Reduction…two part

oObservations (rows, instances, etc)

o Attributes (variables, records, columns, etc)

Data Preparation

Attribute reduction to filter out irrelevant or uninteresting

data without completely removing them from the original

set.

Even if a variable isn’t interesting for answering some

questions, it may still be useful in others.

It is recommended to import all attributes first, then filter as necessary

Data Preparation

Observation Reduction…

Observation reduction is to reduce the # of observations to create a smaller

data set.

Some reasons to do so:

o Create a sample set for: Training data, proof of concept analysis, testing theories, sharing data

o Improve analysis speed or process time

o Data scrubbing for outliers, missing values, etc.

Data Preparation

Ensure consistency of data

oMissing information

o Spelling errors, typos

oMultiple responses for an attribute

oCharacters in numeric fields and vice-versa

Data Preparation


Data Preparation

KEY: Missing data is data that does not exist in a data set• Not the same as zero or some other value

• In a dataset, it is blank and the value is unknown

• Sometimes referred to as null values

• Depending on your objective and the circumstance, you may choose

to leave missing data as they are or replace with some other value


Data Preparation

KEY: Inconsistent data is different from missing data

• Occurs when a value does exist but its value is not valid

or meaningful.

• Common = “.” or “zero”


Data Preparation

Replace or remove missing or inconsistent data

• For numeric data…

• Can be replaced using Measures of Central Tendency

• Mean, Median, and Mode• Mean - Average value• Median - Middle value• Mode - Most frequent or common value


Data Preparation

Replace or remove missing or inconsistent data

• For character data…

• Can be replaced using Best Estimated Value

• “Like Others” • Ex. All males in data like bass fishing. If attribute “Fish Type” is blank and

attribute “Gender” equals male, then “Bass”

• “Clustering Techniques”

• “Best Guess”


Data Preparation

• Replacing missing or inconsistent values found in data

should be done:

• With intention, not haphazardly

• Use common sense

• Be transparent

It is recommended to always document your missing or consistent data processes.

This course is a practical application course in Data Mining. Learning to use

RapidMiner is required.

If you have not done so yet, please plan to walk through the tutorial examples in

RapidMiner.

To assist you in understanding RapidMiner, I will take screenshots of what I am

doing to get the results we are looking for.

RapidMiner is pretty intuitive. You will get it quickly.

Basics of RapidMiner

Types of files that can be imported into RapidMiner:

o CSV File

o Excel File

o XML File

o Access Database Table

o … and much more

We use mainly CSV files which contain Comma Separated Values- be mindful if your dataset

contains commas

o Alternative delimiters can be selected in this case: Tab

Semicolon

Pipe ( l ), etc.


Three main areas that contain useful tools in

RapidMiner:

oOperators – Every possible task you can think of

oRepositories – Where you store your data

o Parameters – Task set up details


Explain concepts and purpose of Data Preparation

Understand solutions for handling missing and inconsistent data

Utilize data and attribute reduction techniques

Effectively work in RapidMiner to prepare your data.

Summary

“This workforce solution was funded by a grant awarded by the U.S. Department of Labor’s Employment and

Training Administration. The solution was created by the grantee and does not necessarily reflect the official

position of the U.S. Department of Labor. The Department of Labor makes no guarantees, warranties, or

assurances of any kind, express or implied, with respect to such information, including any information on linked

sites and including, but not limited to, accuracy of the information or its completeness, timeliness, usefulness,

adequacy, continued availability, or ownership.”

Except where otherwise stated, this work by Wake Technical Community College Building Capacity in Business

Analytics, a Department of Labor, TAACCCT funded project, is licensed under the Creative Commons Attribution

4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Copyright Information

http://creativecommons.org/licenses/by/4.0/

bas 250 lecture 2

Education