why the information explosion can be bad for data mining, and how data fusion provides a way out

6
1 Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out Written By: Putten, Kok, Gupta Presented By: Ernesto Ochandio DSCI 5240 November Dec 7, 2005

Upload: malcolm-kirby

Post on 30-Dec-2015

24 views

Category:

Documents


0 download

DESCRIPTION

Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out Written By: Putten, Kok, Gupta Presented By: Ernesto Ochandio DSCI 5240 November Dec 7, 2005. Problem Definition. Exponential growth in data capture leads to data fragmentation . - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out

1

Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out

Written By: Putten, Kok, Gupta

Presented By: Ernesto OchandioDSCI 5240November Dec 7, 2005

Page 2: Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out

2

Problem Definition

• Exponential growth in data capture leads to data fragmentation.– POS customer tracking– Corporate Data Warehouse– Advanced Analytics

• Increased popularity of personalized messages.• Prohibitive attitudinal data costs.

Page 3: Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out

3

Data Fusion Overview

• Data Fusion is the combination of information from different sources.

• Also known as: Micro Data Set Merging, Statistical Record Linkage, and Multi-Source Imputation

• Example: – Demographic and psychographic data aggregated at

geographical level.– Same characteristics for people in the same region.

• Motivation:– Algorithms can create generalized fusions providing richer

data sets for use in applications or future data mining projects.

Page 4: Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out

4

Data Fusion Terminology

• Recipient, Donor, Fused Variables, Common Variables, Critical Common Variables

+ =

Recipient Donor Fused Dataset

Common Variables Fused Variables

Page 5: Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out

5

C1 X, Y, Z 15 15 15 15C2 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC3 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC4 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC5 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC6 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC7 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC8 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC9 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC10 X, Y, Z 20 20 20 20C11 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC12 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC13 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC14 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC15 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC16 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC17 X, Y, ZxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxC18 X, Y, Zxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Data Fusion Algorithm

• Find best Donor elements that match the Recipient element.• Ensure Critical Variable exact match.• Limit Donor element usage.• Use averages from the Donor set to estimate the Fused variables

for the Recipient set.

+ =

Recipient Donor Fused Dataset

C1 X, Y, ZC2 X, Y, ZC3 X, Y, ZC4 X, Y, ZC5 X, Y, ZC6 X, Y, ZC7 X, Y, ZC8 X, Y, ZC9 X, Y, ZC10 X, Y, ZC11 X, Y, ZC12 X, Y, ZC13 X, Y, ZC14 X, Y, ZC15 X, Y, ZC16 X, Y, ZC17 X, Y, ZC18 X, Y, Z

X, Y, Z 10 10 10 10X, Y, Z 20 20 20 20X, Y, Z 10 10 10 10X, Y, Z 20 20 20 20X, Y, Z 30 30 30 30

Page 6: Why the Information Explosion Can Be Bad for Data Mining, and How Data Fusion Provides a Way Out

6

Conclusion

• Data Fusion increases the value of Data Mining by creating more data to mine while reducing costs and ensuring the best matches possible without over-representing elements in the Donor set.