netezza’s deep dive: getting your data warehouse up and running in 24 hours chi-chung hui...

37
Netezza’s Deep Dive: Getting Your Data Warehouse Up and Running in 24 hours Chi-Chung Hui Consulting I/T Specialist Information Management Software, IBM HK

Upload: barrie-shaw

Post on 18-Dec-2015

215 views

Category:

Documents


2 download

TRANSCRIPT

Netezza’s Deep Dive: Getting Your Data Warehouse Up and Running in 24 hours

Chi-Chung HuiConsulting I/T SpecialistInformation Management Software, IBM HK

© 2010 IBM Corporation

Information Management

2

Simplicity, Flexibility, ChoiceIBM Data Warehouse & Analytics Solutions

IBM InfoSphere Warehouse

IBM Smart Analytics System

Netezza

Information Management Portfolio(Information Server, MDM, Streams, etc)

Warehouse Accelerators

Flexible Integrated SystemTrue Appliance Custom Solution

FlexibilitySimplicity The right mix of simplicity and flexibility

© 2010 IBM Corporation

Information Management

Netezza Value Proposition

Speed: Price/performance leader using hardware-based data streaming

Simplicity: Black-box appliance with no tuning or storage administration provides low TCO and fast time to value

Scalability: True MPP enables customers to conduct rapid queries and analytics on petabyte sized data warehouses

Smart: Built-in advanced analytics pushed deep into database delivers analytics to the masses

© 2010 IBM Corporation

Information Management

Netezza and ISAS

Choose Netezza:

If the best price/performance is required

If customer cannot afford too much tuning and administration

If customer need the fastest time to value

If customer does not want to pay for a separate database software license

Choose ISAS:

If AIX is preferred

If SAN and remote mirroring are required

If customer requires the warehouse to conform to the data center infrastructure standard

If customer likes a more customized warehouse

If customer need very specific/deep tuning techniques

© 2010 IBM Corporation

Information Management

Agenda Netezza Solution Highlight

– What is it?– Why it is good?

Netezza Tour– Specialized hardware and architecture– How Netezza operates in running database queries?

Netezza Simplicity

Netezza Performance

© 2010 IBM Corporation

Information Management

Netezza – Solution Highlight Summary

True Appliance– Hardware, software and storage pre-

built for data warehouse– With specially designed hardware

designed for high-performance advanced analytics operations

– Hardware compression based on table columns

Very fast– Usually 10x to 100x faster than

traditional database

Minimal administration and tuning

Low TCO

© 2010 IBM Corporation

Information Management

Why Netezza is Good?

© 2010 IBM Corporation

Information Management

Legacy DWH Architectures:Moving large amounts of data becomes Bottleneck!!

Results

Query

StorageServerRDBMS SW

Data

Large amounts of data moved from

disk, causing bottleneck

Large amounts of data moved from

disk, causing bottleneck

Data is moved to memory, then

SQL processed

Data is moved to memory, then

SQL processed

© 2010 IBM Corporation

Information Management

Results

Netezza Performance Server™

Netezza Performance ServerWe are better in EDW operations with complex BI queries!

SMP Host(2-4 CPU)

Query

Network traffic:1% of existing systems

CPU :2% of existing systems

Data processed as streams from disk, before moved to

memory

Data processed as streams from disk, before moved to

memory

© 2010 IBM Corporation

Information Management

Agenda Netezza Solution Highlight

– What is it?– Why it is good?

Netezza Tour– Specialized hardware and architecture– How Netezza operates in running database queries?

Netezza Simplicity

Netezza Performance

© 2010 IBM Corporation

Information Management

The IBM Netezza TwinFin™ Appliance

Page 11

High-performance databaseengine streaming joins,aggregations, sorts, etc.

SQL CompilerQuery PlanOptimizeAdmin

Processor &streaming DB logic

Slice of User DataSwap and Mirror partitionsHigh speed data streaming

SMP Hosts

Snippet Blades™

(S-Blades™)

Disk Enclosures

© 2010 IBM Corporation

Information Management

The S-Blade™: CPU Blade + FPGA sidecar

Page 12

© 2011 IBM Corporation13

S-Blade™ Components

Intel Quad-Core

Dual-Core FPGADRAM

IBM BladeCenter Server Netezza DB Accelerator

SAS Expander

Module

SAS Expander

Module

© 2010 IBM Corporation

Information Management

The IBM Netezza AMPP™ Architecture

Advanced AnalyticsAdvanced Analytics

LoaderLoader

ETLETL

BIBI

Applications

FPGA

Memory

CPU

FPGA

Memory

CPU

FPGA

Memory

CPU

HostsHost

Disk Enclosures S-Blades™

NetworkFabric

Netezza Appliance

ODBC/JDBC

ODBC/JDBC

© 2010 IBM Corporation

Information Management

How Netezza Operates in Database Queries?

© 2011 IBM Corporation16

Our Secret Sauce

FPGA Core CPU Core

Uncompress ProjectRestrict,Visibility

Complex ∑Joins, Aggs, etc.

select DISTRICT,

PRODUCTGRP,

sum(NRX)

from MTHLY_RX_TERR_DATA

where MONTH = '20091201'

and MARKET = 509123

and SPECIALTY = 'GASTRO'

Slice of table

MTHLY_RX_TERR_DATA

(compressed)

Slice of table

MTHLY_RX_TERR_DATA

(compressed)

where MONTH = '20091201'

and MARKET = 509123

and SPECIALTY = 'GASTRO'

where MONTH = '20091201'

and MARKET = 509123

and SPECIALTY = 'GASTRO'

sum(NRX)sum(NRX)

select DISTRICT,

PRODUCTGRP,

sum(NRX)

select DISTRICT,

PRODUCTGRP,

sum(NRX)

© 2011 IBM Corporation17

Netezza Eliminates the I/O BottleneckMove the SQL to the hardware… to where the data lives

“Just send the Answer,

not Raw Data”

© 2010 IBM Corporation

Information Management

Agenda Netezza Solution Highlight

– What is it?– Why it is good?

Netezza Tour– Specialized hardware and architecture– How Netezza operates in running database queries?

Netezza Simplicity

Netezza Performance

© 2010 IBM Corporation

Information Management

19

Why traditional database systems are not enough: Endless tuning

businessperson

Query performance

is slow

Query performance

is slow

© 2010 IBM Corporation

Information Management

20

Why traditional database systems are not enough: Endless tuning

businessperson

technicalperson

I’ll add an indexI’ll add an index

© 2010 IBM Corporation

Information Management

21

Why traditional database systems are not enough: Endless tuning

businessperson

Load performance is slow. When can

I access my data?

Load performance is slow. When can

I access my data?

© 2010 IBM Corporation

Information Management

22

Why traditional database systems are not enough: Endless tuning

businessperson

technicalperson

I’ll investigate and get back to you …I’ll investigate and get back to you …

© 2010 IBM Corporation

Information Management

23

Why traditional database systems are not enough: Endless tuning

businessperson

technicalperson

Okay… I will add an aggregate table to pre-calculate so

that the report will run faster.

Okay… I will add an aggregate table to pre-calculate so

that the report will run faster.

© 2010 IBM Corporation

Information Management

24

Why traditional database systems are not enough: Endless tuning

businessperson

I want my report to be refreshed every 1 hour.

I want my report to be refreshed every 1 hour.

© 2010 IBM Corporation

Information Management

25

Why traditional database systems are not enough: Endless tuning

businessperson

technicalperson

Oh… that is impossible… The

report will be updated once

everyday after night batch…

Oh… that is impossible… The

report will be updated once

everyday after night batch…

© 2010 IBM Corporation

Information Management

26

Why traditional database systems are not enough: Wasted effort

Task Description Transform InspectNon-value Process

Value-adding Process

move data from sources 120

reconcile data 20

sort and prep 30

drop indices 5

drop constraints 1

drop aggregates 2

drop materialized views 2

load data 30

create constraints 180

create indices 90

create materialized views 60

create aggregates 120

gather statistics 300

© 2010 IBM Corporation

Information Management

27

Solving the data load and query performance problem

We act out the market every day to capitalize on opportunities. Complex merchandize reports that had taken days to process on the old platform now take five minutes on the new one. Simpler queries are even faster.

-- Chief Information Officer at a large US retailer

Data loads jobs  Oracle  Netezza

1 + 5 hours 2 mins 53 secs

2 1 hour 12 mins 7 secs 3 mins 29 secs

3 1 hour 25 mins 56 secs 4 mins 20 secs

4 1hour 30 mins 00 secs 5 mins 42 secs

© 2010 IBM Corporation

Information Management

Netezza Loads Data at 2.5TB per Hour

0

500

1000

1500

2000

2500

3000

1 ta

sk

2 ta

sks

3 ta

sks

4 ta

sks

5 ta

sks

6 ta

sks

7 ta

sks

8 ta

sks

9 ta

sks

10 ta

sks

11 ta

sks

12 ta

sks

13 ta

sks

14 ta

sks

15 ta

sks

Throughput (GB/Hr)

© 2010 IBM Corporation

Information Management

Operations• Simply load and go .… it’s an appliance

Minimal DBA Tuning• No configuration or physical modeling• No indexes– out of the box performance

ETL Developers• No aggregate tables needed

->Less ETL logic• Faster load and transformation times

Business Analysts• Train of thought analysis – 10 to 100x faster• True ad hoc queries – no tuning, no indexes• Ask complex queries against large datasets

Page 29

Netezza is Simple to Deploy Since it is so Fast

© 2011 IBM Corporation30

Traditional Complexity … Netezza Simplicity

0. CREATE DATABASE TEST LOGFILE 'E:\OraData\TEST\LOG1TEST.ORA' SIZE 2M, 'E:\OraData\TEST\LOG2TEST.ORA' SIZE 2M, 'E:\OraData\TEST\LOG3TEST.ORA' SIZE 2M, 'E:\OraData\

TEST\LOG4TEST.ORA' SIZE 2M, 'E:\OraData\TEST\LOG5TEST.ORA' SIZE 2M EXTENT MANAGEMENT LOCAL MAXDATAFILES 100 DATAFILE 'E:\OraData\TEST\SYS1TEST.ORA'

SIZE 50 M DEFAULT TEMPORARY TABLESPACE temp TEMPFILE 'E:\OraData\TEST\TEMP.ORA' SIZE 50 M

UNDO TABLESPACE undo DATAFILE 'E:\OraData\TEST\UNDO.ORA' SIZE 50 M NOARCHIVELOG CHARACTER SET WE8ISO8859P1;

1. Oracle* table and indexes  2. Oracle tablespace    3. Oracle datafile      4. Veritas file        5. Veritas file system           6. Veritas striped logical volume              7. Veritas mirror/plex                8. Veritas sub-disk                   9. SunOS raw device                     10. Brocade SAN switch                       11. EMC Symmetrix volume                          12. EMC Symmetrix striped meta-volume                            13. EMC Symmetrix hyper-volume                                14. EMC Symmetrix remote volume (replication)                                 15. Days/weeks of planning meetings

Netezza: Low (ZERO) Touch:

CREATE DATABASE my_db;

© 2010 IBM Corporation

Information Management

31

Netezza Delivers Simplicity

Up and running 6 months before being trained

200X faster than Oracle system

ROI in less than 3 months

“ “

Allowing the business users access to the Netezza box was what sold it.

-- Steve Taff, Executive Dir. of IT Services

© 2010 IBM Corporation

Information Management

Agenda Netezza Solution Highlight

– What is it?– Why it is good?

Netezza Tour– Specialized hardware and architecture– How Netezza operates in running database queries?

Netezza Simplicity

Netezza Performance

© 2010 IBM Corporation

Information Management

POC - A Telco Company

Environment– Netezza TwinFin 12 full rack

Raw Data volume– Call Level Detail : 3TB (9 billion rows)– Financial Bill : 600GB (5.4 billion rows)– Customer Info : 60GB (91.1 million rows)

© 2010 IBM Corporation

Information Management

POC: Data Maintenance

Testing scenarios Raw Data involved Elapsed Time

Insert data to an empty table from a table with 750 million records

270GB 3m47s

Concurrently insert data to 5 empty tables from 5 tables with 550 million records each respectively

990GB = 198GB x 5 11m35s

Update all rows of a table with 1.5 billion records

270GB 9m40s

Concurrently update all rows for 5 tables with 1.1 billion records each

1,350GB = 270GB x 5 44m44s

Delete 183 million records from a table with 550 million records

198GB 1m32s

Concurrently delete 183 million records from 5 tables with 550 million records each

990GB = 198GB x 5 9m43s

© 2010 IBM Corporation

Information Management

POC: Enquiries (With NO Indexes)

Testing scenarios Table involved Elapsed Time for Single Task

Elapsed Time for Concurrent Tasks

Query on single fact table Call Level Detail (3TB, 9 billion records)

1m16s 2m49s (10 tasks)

Query on joining 2 fact tables Call Level Detail (3TB, 9 Billion records) Customer Info (60GB, 91.1 million records)

22s 58s (5 tasks)

Query to get the top 100 call duration time in different groups for a particular month

Call Level Detail (3TB, 9 Billion records) Customer Info (60GB, 91.1 million records)

1m53s 4m10s (5 tasks)

Query on joining a fact table to multiple dimension tables

Customer Info (60GB, 91.1 million records) 5 dimension tables

5s 14s (5 tasks)

Query on joining a fact table with subquery involve joining another two fact tables

Call Level Detail (3TB, 9 Billion records) Customer Info (60GB, 91.1 million records) Financial Bill (600GB, 5.4 billion records)

21s 1m55s (5 tasks)

© 2010 IBM Corporation

Information Management

36

Catalina Marketing: Building loyalty one customer at a time

No targeting Basic targeting e.g., offer dog food

coupon to customer buying dog food

Using predictive models to find latent

correlations

Coupon redemption rate

1% 6-10% 25%

Marketing to a segment of one – 195 million US loyalty program members

– Every coupon printed is unique to the individual customer

– Customized based on three years' worth of purchase history

Increased staff productivity – from 50 to 600 new models per year

Increased efficiency – from 4 hours to score a model to 60 seconds

Netezza’s Deep Dive: Getting Your Data Warehouse Up and Running in 24 hours

Chi-Chung HuiConsulting I/T SpecialistInformation Management Software, IBM HK