dw 2.0 book

10
DW2 0 T h e  rchitecture  for the  Next G eneration  of Data W arehousing W H Inmon Forest  R im  Technology D erek Str auss Gavroshe Genia Neushloss Gavroshe AMSTERDAM • BOSTON • HEIDELBERG • LONDON NEW YORK • OXFORD • PARIS • SAN DIEGO SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO К Morgan Kaufmann Publishers is an imp rint of Elsevier.  MORG N K UFM NN PUBLISHERS

Upload: himanshu-agarwal

Post on 17-Feb-2018

215 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DW 2.0 book

7/23/2019 DW 2.0 book

http://slidepdf.com/reader/full/dw-20-book 1/10

DW 2 0

T h e   r c h i t e c t u r e f o r t h e  N e x t G e n e r a t io n  o

D a ta W a r e h o u s in g

W H Inmon

Forest  R im  Technology

Derek Strauss

Gavroshe

Genia Neushloss

Gavroshe

AMSTERDAM • BOSTON • HEIDELBERG • LONDON

NEW YORK • O XFORD • PARIS • SAN DIEG O

SAN FRANCISCO • SINGAPORE • SYDNEY • TOKYO

К

Mo rgan Ka ufman n Pub l i she rs is an imp r in t o f E l sev ie r.

  M O R G N K U F M N N P U B L I S H E R

Page 2: DW 2.0 book

7/23/2019 DW 2.0 book

http://slidepdf.com/reader/full/dw-20-book 2/10

  o n t e n t s

Preface xvii

Acknowledgments xx

Abou t the Authors xxi

CHAPTER 1 A brief history of data wareho using and f irst gen erat ion

data warehouses

  1

Da tabase managem ent systems 1

On line applications 2

Personal com puters and 4GL technology 3

The spider web environm ent 4

Evolution from the business perspective 5

The data warehouse environm ent 6

Wh at is a data warehouse? 7

Integrating data—a painful experience 7

Volumes of data 8

A different deve lopm ent appro ach 8

Evolution to the DW 2.0 environm ent 9

The business impa ct of the data wareh ouse 11

Various com ponen ts of the data warehouse environm ent 11

ETL—extract/transform/load 12

OD S— operational data store 13

Data mart 13

Exploration ware house 13

The evolution of data ware housing from the business perspective 14

Other notions about a data warehouse 14

The active data ware house 15

The federated data ware house approa ch 16

The star schema approa ch 18

The data ma rt data warehouse 20

Building a real data wareh ouse 21

Summary 22

CHAPTER 2 An introd uct ion to DW 2.0

  2 3

DW 2.0—a new paradigm 24

DW 2.0—from the business perspective 24

The life cycle of data 27

Reasons for the different sectors 30

Metadata 31

Access of data 33

Structured data/u nstructu red data 34

Page 3: DW 2.0 book

7/23/2019 DW 2.0 book

http://slidepdf.com/reader/full/dw-20-book 3/10

v ii i Contents

Textual analytics

Blather

The issue of termino logy

Specific text/genera l text

Metadata—a major com pon ent

Local me tadata

A foundation of technology

Changing business requirements

The flow of data with in DW 2.0

Volumes of data

Useful app lication s

DW 2.0 an d referential integrity

Reporting in DW 2.0

Summary

CHAPTER 3 DW 2 0 com ponents— about the dif ferent sectors

 

The Interactive Sector

The Integrated Sector

The Near Line Sector

The Archival Sector

Unstructured processing

From the business perspective

Summary

CHAPTER  4   M etadata in DW 2 0  

Reusability of data an d analysis

Me tadata in DW 2.0

Active repository/p assive repos itory

The active repos itory 1

Enterprise m etad ata 1

Metadata and the system of record 1

Taxonomy 1

Internal taxonom ies/external taxonom ies 1

M etadata in the Archival Sector 1

M aintaining me tadata 1

Using metadata— an example 1

From the end-u ser perspective 1

Summary 1

CHAPTER 5 Fluid ity of the

 DW

 2 0 technology infrastructure

 

The techn ology infrastructure 1

Rapid busin ess chang es 1

Page 4: DW 2.0 book

7/23/2019 DW 2.0 book

http://slidepdf.com/reader/full/dw-20-book 4/10

 ont nt

The treadm ill of change 114

Getting off the tread m ill 115

Reducing the length of time for IT to respon d 115

Sem antically tem pora l, sema ntically static data 115

Sema ntically tem pora l data 116

Sema ntically stable data 117

Mixing sema ntically stable and unsta ble data 118

Separating sema ntically stable and unsta ble data 118

Mitigating business change 119

Creating snapsho ts of data 120

A historical record 120

Dividing data 121

From the end-use r perspective 121

Summary 122

CHAPTER

 6

  M e t h o d o l o g y

 a n d

 a p p roa c h

 f o r D W 2 . 0  1 2 3

Spiral me thodology — a sum ma ry of key features 124

The seven streams appro ach— an overview 129

Enterprise reference mo del stream 129

Enterprise knowledge coo rdinatio n stream 129

Inform ation factory dev elopm ent stream 133

Data profiling and m app ing stream 133

Data correction stream 133

Infrastructure stream 133

Total information quality ma nagem ent stream 134

Summary 137

CHAPTER

 7

  Stat ist ical processing

 an d D W 2 . 0  1 4 1

Two types of transactions 141

Using statistical analysis 143

The integrity of the com pariso n 144

He uristic analysis 145

Freezing da ta 146

Exploration processing 146

The frequency of analysis 147

The exp loratio n facility 147

The sources for explora tion processing 149

Refreshing exp loratio n data 149

Project-based data 150

Data ma rts and the exploration facility 152

Ab ackflow of data 152

Using exploration data internally 155

Page 5: DW 2.0 book

7/23/2019 DW 2.0 book

http://slidepdf.com/reader/full/dw-20-book 5/10

x Contents

From the perspective of the business analyst 1

Summary 1

CHAPTER 8 Data m odels and DW 2.0

  1

An intellectual road m ap 1

The data mo del and business 1

The scope of integration 1

Making the distinction between granular and sum marized data 1

Levels of the data mo del 1

Data m ode ls and the Interactive Sector 1

The corporate data mo del 1

A transformation of mod els 1

Data mo dels and unstructured data 1

From the perspective of the busines s user 1

Summary 1

CHAPTER 9 Mo nitoring the DW 2.0 env ironm ent

  1

Mo nitoring the DW 2.0 environm ent 1

The transaction m onitor 1

M onitoring data quality 1

A data warehouse m onitor 1

The transaction mon itor—response time 1

Peak-period processing 1

The ETL data quality m on itor 1

The data warehouse m onitor 1

Do rm ant data 1

From the perspective of the business user 1

Summary 1

CHAPTER

 1 0

  DW

 2.0

 and se curity

 

Protecting access to data 1

Encryption 1

Drawbacks 1

The firewall 1

M oving data offline 1

Limiting encry ption 1

A direct du m p 1

The data warehouse m onitor 1

Sensing an attack 1

Security for nea r line data 1

From the perspective of the business user 1

Summary 1

Page 6: DW 2.0 book

7/23/2019 DW 2.0 book

http://slidepdf.com/reader/full/dw-20-book 6/10

  ont nt

CHAPTER

 1 1

  T im e v a r ia n t da t a  1 9 1

All data in DW 2.0—relative to tim e 191

Tim e relativity in the Interactive Sector 192

Data relativity elsewhere in DW 2.0 192

Transactions in the Integrated Sector 193

Discrete data 194

Co ntinuo us time span data 194

A sequen ce of records 196

Nonov erlapping records 197

Beginning and ending a sequence of records 197

Con tinuity of data 198

Time-collapsed data 198

Time variance in the Archival Sector 199

From the perspective of the end user 200

Summary 200

CHAPTER 12   T h e low o f da t a in DW 2 .0

  2 3

The flow of data throu gho ut the architecture 203

Enterin g the Interactive Sector 203

The role of

 ETL

  205

Data flow into the Integrated Sector 205

Data flow into the Near Line Sector 207

Data flow into the Archival Sector 209

The falling prob ability of data access 209

Exception-based flow of data 210

From the perspective of the business user 213

Summary 214

CHAPTER 13

  ETL

 processing an d DW 2.0   215

Ch anging states of data 215

W here ETL fits 215

From application data to corporate data 216

ETL in on line m od e 216

ETL in batch m od e 217

Source and target 218

An ETL m app ing 219

Changing states—an example 219

More complex transformations 221

ETL and throu ghp ut 222

ETL and meta data 22 3

ETL  and an audit trail 223

Page 7: DW 2.0 book

7/23/2019 DW 2.0 book

http://slidepdf.com/reader/full/dw-20-book 7/10

ETL  and data quality 2

Crea ting ETL 2

Cod e creation or parame trically driven ETL 2

ETL an d rejects 2

Changed data capture 2

ELT 2

From the perspective of the busine ss user 2

Summary 2

CHAPTER 14 DW 2 .0 and th e granular i ty m anag er  2

The granularity ma nage r 2

Raising the level of gran ularity 2

Filtering data 2

The functions of the granularity man ager 2

Hom e-grown versus third-party granularity ma nagers 2

Parallelizing the granularity man ager 2

Metadata as a by-product 2

From the perspective of the business user 2

Summary 2

CHAPTER 15 DW 2.0 and perform ance   2

Goo d performance— a cornerstone for DW 2.0 2

Online response time 2

Analytical response time 2

The flow of data 2

Queues 2

Heuristic processing 2

Analytical productivity and response time 2

Many facets to performan ce 2

Indexing 2

Removing dorm ant data 2

End-user educ ation 2

Monitoring the environm ent 2

Capacity planning 2

Metadata 2

Batch parallelization 2

Parallelization for transaction processing 2

Workload ma nagem ent 2

Data ma rts 2

Exp loration facilities 2

Separation of transactions into classes 2

Service level agreem ents 2

Page 8: DW 2.0 book

7/23/2019 DW 2.0 book

http://slidepdf.com/reader/full/dw-20-book 8/10

  onten

Protecting the Interactive Sector 25 4

Partitioning data 255

Choo sing the proper hardware 255

Separating farmers and explorers 256

Physically grou p data together 257

Check automatically generated code 257

From the perspective of the business user 258

Summary 259

CHAPTER 16 M igration  26

House s and cities 261

Migration in a perfect world 262

The perfect world almo st never hap pen s 262

Adding com pone nts incrementally 262

Adding the Archival Sector 264

Creating enterprise me tadata 265

Building the me tadata infrastructure 266

Swallowing source systems 266

ETL  as a shock absorber 267

Migration to the unstructured environm ent 267

From the perspective of the business user 269

Summary 270

CHAPTER 17 Cost justific ation and DW 2 0  27

Is DW 2.0 wo rth it? 271

Macro-level justification 271

A micro-level cost justification 27 2

Company  В  has DW 2.0 273

Creating new analysis 273

Executing the steps 274

So ho w m uch does all of this cost? 276

Consider company  В   276

Factoring the cost of DW 2.0 277

Reality of inform ation 278

The real econo mics of DW 2.0 279

The time value of information 279

The value of integration 280

Historical inform ation 280

First-generation DW and DW 2.0— the econo mics 281

From the perspective of the busines s user 282

Summary 282

Page 9: DW 2.0 book

7/23/2019 DW 2.0 book

http://slidepdf.com/reader/full/dw-20-book 9/10

i v Contents

CHAPTER 18 Data quality in DW 2 0  2

The DW 2.0 data quality tool set 2

Data profiling too ls and the reverse-engineered data mo del 2

Data mo del types 2

Data profiling inconsistencies challenge top-dow n mo deling 2

Summary 2

CHAPTER 19 DW 2 0 and unstructured data  29

DW 2.0 and unstructured data 2

Reading text 2

W here to do textual analytical processing 3

Integrating text 3

Simple editing 3

Stop words 3

Synonym replacement 3

Synonym concatenation 3

Hom ograph ic resolution 3

Creating them es 3

External glossaries/taxon om ies 3

Stemming 3

Alternate spellings 3

Text across languag es 3

Direct searches 3

Indirect searches 3

Terminology 3

Sem istructure d data/VALUE = NAME data 3

The technology needed to prepare the data 3

The relational data base 3

Structured/unstructured linkage 3

From the perspective of the business user 3

Summary 3

CHAPTER 20 DW 2 0 and the system of record 3

Oth er systems of record 3

From the perspective of the bus iness user 3

Summary 3

CH APT ER21 M iscellaneous topics  3

Data marts 3

The convenience of a data mart 3

Transforming data mart data 3

Page 10: DW 2.0 book

7/23/2019 DW 2.0 book

http://slidepdf.com/reader/full/dw-20-book 10/10

Mo nitoring DW 2.0 326

Moving data from one data mart to anothe r 327

Bad data 329

A balancing entry 330

Resetting a value 330

Making corrections 330

The speed of mo veme nt of data 331

Data ware hou se utilities 332

Summary 337

CHAPTER 22 Processing in th e DW 2.0 env ironm ent   3 3 9

Summary 345

CHAPTER 23 Adm in ister ing the DW 2.0 env i ronm ent

  347

The data mod el 347

Architectural adm inistra tion 348

Defining the m om en t wh en an Archival Sector will be need ed 348

Dete rmin ing wh ether the Near Line Sector is needed 349

Metadata adm inistration 351

Datab ase administrat ion 352

Stewardship 353

Systems and technology adm inistration 355

Man agemen t adm inistration of the DW 2.0 environ me nt 358

Prioritization and prioritization conflicts 358

Budget 358

Scheduling and determ ination of milestones 359

Allocation of resources 359

Managing consultants 359

Summary 361

Index  363