bas 250 lecture 2

27
BAS 250 Lesson 2: Data Preparation

Upload: wake-tech-bas

Post on 14-Apr-2017

80 views

Category:

Education


1 download

TRANSCRIPT

Page 1: BAS 250 Lecture 2

BAS 250Lesson 2: Data Preparation

Page 2: BAS 250 Lecture 2

Explain concepts and purpose of Data Preparation

Understand solutions for handling missing and

inconsistent data

Utilize data and attribute reduction techniques

Effectively work in RapidMiner to prepare your data.

This Week’s Learning Objectives

Page 3: BAS 250 Lecture 2

The Data Mining Process: CRISP-DM

Page 4: BAS 250 Lecture 2

– Join data sets that are needed for your

analysis

– Reduce data sets to only include pertinent

variables

– Scrub data to remove anomalies- outliers or

missing data

– Reformat for consistency and effective use

3. Data Preparation

Page 5: BAS 250 Lecture 2

Ensure robustness of data

o Combine more 2 or more data sets to create a “mini –

database” with all variables needed for analysis in one place.

o Merge by a unique identifier common to both data sets

“Key Identifier”, “Common ID”, “ID Number”, etc.

Example: Social Security Number (links Medical and Insurance)

Data Preparation

Page 6: BAS 250 Lecture 2

Data PreparationExample: Sources of Data

Customer Purchases - “Point of Sale data” – CSV file format

Cost of Products Sold – “Accounting department” – Excel file format

Inventory of Products - “ IT Data Warehouse” - XML file format

Merge By Product ID or SKU

Page 7: BAS 250 Lecture 2

Data Reduction…two part

oObservations (rows, instances, etc)

o Attributes (variables, records, columns, etc)

Data Preparation

Page 8: BAS 250 Lecture 2

Attribute reduction to filter out irrelevant or uninteresting

data without completely removing them from the original

set.

Even if a variable isn’t interesting for answering some

questions, it may still be useful in others.

It is recommended to import all attributes first, then filter as necessary

Data Preparation

Page 9: BAS 250 Lecture 2

Observation Reduction…

Observation reduction is to reduce the # of observations to create a smaller

data set.

Some reasons to do so:

o Create a sample set for: Training data, proof of concept analysis, testing theories, sharing data

o Improve analysis speed or process time

o Data scrubbing for outliers, missing values, etc.

Data Preparation

Page 10: BAS 250 Lecture 2

Ensure consistency of data

oMissing information

o Spelling errors, typos

oMultiple responses for an attribute

oCharacters in numeric fields and vice-versa

Data Preparation

Page 11: BAS 250 Lecture 2

Ensure consistency of data

Data Preparation

KEY: Missing data is data that does not exist in a data set• Not the same as zero or some other value

• In a dataset, it is blank and the value is unknown

• Sometimes referred to as null values

• Depending on your objective and the circumstance, you may choose

to leave missing data as they are or replace with some other value

Page 12: BAS 250 Lecture 2

Ensure consistency of data

Data Preparation

KEY: Inconsistent data is different from missing data

• Occurs when a value does exist but its value is not valid

or meaningful.

• Common = “.” or “zero”

Page 13: BAS 250 Lecture 2

Ensure consistency of data

Data Preparation

Replace or remove missing or inconsistent data

• For numeric data…

• Can be replaced using Measures of Central Tendency

• Mean, Median, and Mode• Mean - Average value• Median - Middle value• Mode - Most frequent or common value

Page 14: BAS 250 Lecture 2

Ensure consistency of data

Data Preparation

Replace or remove missing or inconsistent data

• For character data…

• Can be replaced using Best Estimated Value

• “Like Others” • Ex. All males in data like bass fishing. If attribute “Fish Type” is blank and

attribute “Gender” equals male, then “Bass”

• “Clustering Techniques”

• “Best Guess”

Page 15: BAS 250 Lecture 2

Ensure consistency of data

Data Preparation

• Replacing missing or inconsistent values found in data

should be done:

• With intention, not haphazardly

• Use common sense

• Be transparent

It is recommended to always document your missing or consistent data processes.

Page 16: BAS 250 Lecture 2

This course is a practical application course in Data Mining. Learning to use

RapidMiner is required.

If you have not done so yet, please plan to walk through the tutorial examples in

RapidMiner.

To assist you in understanding RapidMiner, I will take screenshots of what I am

doing to get the results we are looking for.

RapidMiner is pretty intuitive. You will get it quickly.

Basics of RapidMiner

Page 17: BAS 250 Lecture 2

Types of files that can be imported into RapidMiner:

o CSV File

o Excel File

o XML File

o Access Database Table

o … and much more

We use mainly CSV files which contain Comma Separated Values- be mindful if your dataset

contains commas

o Alternative delimiters can be selected in this case: Tab

Semicolon

Pipe ( l ), etc.

Basics of RapidMiner

Page 18: BAS 250 Lecture 2

Three main areas that contain useful tools in

RapidMiner:

oOperators – Every possible task you can think of

oRepositories – Where you store your data

o Parameters – Task set up details

Basics of RapidMiner

Page 19: BAS 250 Lecture 2

Basics of RapidMiner

Page 20: BAS 250 Lecture 2

Basics of RapidMiner

Page 21: BAS 250 Lecture 2

Basics of RapidMiner

Page 22: BAS 250 Lecture 2

Basics of RapidMiner

Page 23: BAS 250 Lecture 2

Basics of RapidMiner

Page 24: BAS 250 Lecture 2

Basics of RapidMiner

Page 25: BAS 250 Lecture 2

Basics of RapidMiner

Page 26: BAS 250 Lecture 2

Explain concepts and purpose of Data Preparation

Understand solutions for handling missing and inconsistent data

Utilize data and attribute reduction techniques

Effectively work in RapidMiner to prepare your data.

Summary

Page 27: BAS 250 Lecture 2

“This workforce solution was funded by a grant awarded by the U.S. Department of Labor’s Employment and

Training Administration. The solution was created by the grantee and does not necessarily reflect the official

position of the U.S. Department of Labor. The Department of Labor makes no guarantees, warranties, or

assurances of any kind, express or implied, with respect to such information, including any information on linked

sites and including, but not limited to, accuracy of the information or its completeness, timeliness, usefulness,

adequacy, continued availability, or ownership.”

Except where otherwise stated, this work by Wake Technical Community College Building Capacity in Business

Analytics, a Department of Labor, TAACCCT funded project, is licensed under the Creative Commons Attribution

4.0 International License. To view a copy of this license, visit http://creativecommons.org/licenses/by/4.0/

Copyright Information