chapter 1: introduction to data mining, warehousing, and visualization

25
1 1 Modern Data Warehousing, Mining & Visualization, 2003, George Marakas Chapter 1: Introduction to Data Mining, Warehousing, and Visualization Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012

Upload: filia

Post on 14-Jan-2016

68 views

Category:

Documents


3 download

DESCRIPTION

Chapter 1: Introduction to Data Mining, Warehousing, and Visualization. Modern Data Warehousing, Mining, and Visualization: Core Concepts by George M. Marakas Spring 2012. Objectives. What is the purpose and motivation for developing a Data Warehouse (DW)? - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

1

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

Chapter 1: Introduction to Data Mining, Warehousing,

and Visualization

Modern Data Warehousing, Mining, and Visualization: Core Concepts

by George M. Marakas

Spring 2012

Page 2: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

2

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

Objectives

What is the purpose and motivation for developing a Data Warehouse (DW)?

Position of DW within IT infrastructure Relationship between DW and business data mart What can a DW do? Foundations for Data Mining Steps in a typical Data mining project What is a “Correlation”? KEY CONCEPT History of Data Visualization vis-à-vis DW

Page 3: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

3

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

1-1: The Modern Data Warehouse

A data warehouse is a copy of transaction data specifically structured for querying, analysis and reporting

Note that the data warehouse contains a copy of the transactions. These are not updated or changed later by the transaction system.

Also note that this data is specially structured, and may have been transformed when it was placed in the warehouse

Page 4: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

4

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

1-2: Data Warehouse Roles and Structures

The DW has the following primary functions: It is a direct reflection of the business rules of the

enterprise. It is the collection point for strategic information. It is the historical store of strategic information. It is the source of information later delivered to data

marts. It is the source of stable data regardless of how the

business processes may change.

Page 5: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

5

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

Elements of a DW

ExtractTransformStore[ETS]

Page 6: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

6

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

Position of the Data Warehouse Within the Organization – Figure 1-2

Page 7: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

11

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

Data Mining ExampleService Quality vs. Training

Courtesy: MicroStrategy (2005)

Page 8: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

12

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

Sales Analysis Determine real-time product sales to make vital pricing and distribution decisions. Analyze historical product sales to determine success or failure attributes. Evaluate successful products and determine key success factors. Use corporate data to understand the margin as well as the revenue implications of a decision. Rapidly identify a preferred customer segments based on revenue and margin. Quickly isolate past preferred customers who no longer buy. Identify daily what product is in the manufacturing and distribution pipeline. Instantly determine which salespeople are performing, on both a revenue and margin basis, and which are

behind.

Financial Analysis Compare actual to budgets on an annual, monthly and month-to-date basis. Review past cash flow trends and forecast future needs. Identify and analyze key expense generators. Instantly generate a current set of key financial ratios and indicators. Receive near-real-time, interactive financial statements.

Human Resource Analysis Evaluate trends in benefit program use. Identify the wage and benefits costs to determine company-wide variation. Review compliance levels for EEOC and other regulated activities.

Other Areas Warehouses have also been applied to areas such as: logistics, inventory, purchasing, detailed transaction

analysis and load balancing.

Examples of Common DW ApplicationsTable 1-1

Page 9: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

13

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

Table 1-2

Costs Hardware, software, development personnel and consultant costs. Operational costs like ongoing systems maintenance.

Benefits Added Revenue Will the new (business objective) process generate new customers (what is the

estimated value?) Will the new (business objective) process increase the buying propensity of

existing customers (by how much?) Is the new process necessary to ensure that the competition doesn't offer a

demanded service that you can't match? Reduced costs What costs of current systems will be eliminated? Is the new process intended to make some operation more efficient? If so, how and

what is the dollar value?

Comparison of Typical DW Costs and Benefits

Page 10: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

14

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

1-4: The Cost of DW

Expenditures can be categorized as one-time initial costs or as recurring, ongoing costs.

The initial costs can further be identified as for hardware or software.

Expenditures can also be categorized as capital costs (associated with acquisition of the warehouse) or as operational costs (associated with running and maintaining the warehouse)

Cost of a Data Warehouse: Rule of Thumb: $1 million per 1 Terabyte of data

Page 11: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

15

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

Recurring Costs One-Time Costs

Capital Hardware maintenance Software maintenance Terminal analysis Middleware

Hardware Software Disk DBMS CPU Terminal analysis Network Terminal Analysis Middleware Log utility Processing Metadata Infrastructure

Operational Ongoing refreshment Integration transformation Data model maintenance Record identification maintenance Metadata infrastructure maintenance Archival of data Data aging within the DW

Integration/transformation processing specification

Metadata infrastructure population System of record definition Data dictionary language definition Network transfer definition CASE/Repository interface Initial data warehouse population Data model definition Database design definition

Expenditures Associated with Building a DWTable 1-3

Page 12: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

16

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

1-5: Data Mining:Farmers and Explorers

Every corporation has two types of DW users. Farmers [Traditional Statistical Hypothesis

testing] know what they want before they set out to find it. They submit small queries and retrieve small nuggets of information.

Explorers [Data Mining] are quite unpredictable. They often submit large queries. Sometimes they find nothing, sometimes they find priceless “golden” nuggets.

Cost justification for the DW is usually done on the basis of the results obtained by farmers since explorers are unpredictable.

Page 13: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

17

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

1-6: Foundations of Data Mining

Data mining is the process of using raw data to infer important business relationships.

Despite a consensus on the value of data mining, a great deal of confusion exists about what it is.

It is a collection of powerful techniques intended for analyzing large datasets.

There is no single data mining approach, but rather a set of techniques that can be used in combination with each other.

Page 14: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

18

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

1-6 & -7: The Foundations of Data Mining

Data mining has roots in practice dating back over 30 years using standard statistics [e.g., bio-statistics]

In the early 1960s, data mining was called statistical analysis, and the pioneers were statistical software companies such as SAS and SPSS.

By the 1980s, the traditional techniques had been augmented by new methods such as fuzzy logic, heuristics and neural networks.

Also, DSS tools came into popular use in the 1980’s with tools such as Lotus 1-2-3 & EXCEL

Page 15: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

19

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

Data Mining – A General Approach

Although all data mining endeavors are unique, they possess a common set of process steps:

1. Infrastructure preparation – choice of hardware platform, the database system and one or more mining tools

2. Exploration – looking at summary data, sampling and applying intuition [Data visualization useful here]

3. Analysis – each discovered pattern is analyzed for significance and trends

Page 16: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

20

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

A General Approach (continued)

4. Interpretation – Once patterns have been discovered and analyzed, the next step is to interpret them. Considerations include business cycles, seasonality and the population the pattern applies to.

5. Exploitation – this is both a business and a technical activity. One way to exploit a pattern is to use it for prediction. Others are to package, price or advertise the product in a different way.

Page 17: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

24

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

The Data Warehouse and Data Mining

Data mining does not require the use of a data warehouse (DW), however, DWs are designed with data mining in mind.

The data in the DW is integrated and stable (non-volatile)

Data changes continuously in an operational database.

If multiple analyses are run in sequence, the data need to be held constant (as in a DW).

Page 18: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

25

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

Volumes of Data – The Biggest Challenge

The largest challenge a “data miner” may face is the sheer volume of data in the warehouse.

It is quite important, then, that summary data also be available to get the analysis started.

A major problem is that this sheer volume may mask the important relationships the analyst is interested in.

The ability to overcome the volume and visualize the data becomes quite important.

Page 19: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

26

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

1.9: Foundations of Data Visualization [DV]

One of the earliest known examples of data visualization was in London during the 1854 cholera epidemic. A map (next slide) helped to identify the source of the disease.

Modern visualization techniques grew from the twin technologies of computer graphics and high performance computing in the 1970s and 1980s.

Page 20: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

27

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

Dr. John Snow used a map to show the source of

cholera was a water

pump, thus proving the

disease was water

borne.

Broad StreetPump

Page 21: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

28

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

DV: Opportunity and Timing

Alternative input devices (light pen, sketch pad and mouse) began to appear in the 1960s.

In the 1970s, flight simulators became much more realistic when graphics replaced film.

In the same decade, special effects computers became entrenched in the entertainment industry.

In the 1980s, visualization grew more dynamic with applications like the animation of weather patterns.

Page 22: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

31

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

0

10

20

30

40

50

60

70

80

90

1st Qtr 2nd Qtr 3rd Qtr 4th Qtr

East

West

North

Data Visualization – Sales by RegionTypical Spreadsheet Graphic

Page 23: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

33

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

Data Visualization – Total Precipitation

Page 24: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

35

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

DV & DM: Future Success Drivers

In the 1990s, rapid advances in chip technology, both at the CPU and the graphics processor, put data visualization everywhere.

On-going reduced costs of computing. Each new generation has a 10X-100X performance-

cost improvements. Approximately every 18 months [Moore’s Law].

Web-based E-commerce Business to Consumer Commerce [B to C; and C:C] Generates billions and even trillions of characters per

reporting period

Page 25: Chapter 1:  Introduction to Data Mining, Warehousing, and Visualization

36

1

Modern Data Warehousing, Mining & Visualization, 2003, George Marakas

The End