netezza’s deep dive: getting your data warehouse up and running in 24 hours chi-chung hui...
TRANSCRIPT
Netezza’s Deep Dive: Getting Your Data Warehouse Up and Running in 24 hours
Chi-Chung HuiConsulting I/T SpecialistInformation Management Software, IBM HK
© 2010 IBM Corporation
Information Management
2
Simplicity, Flexibility, ChoiceIBM Data Warehouse & Analytics Solutions
IBM InfoSphere Warehouse
IBM Smart Analytics System
Netezza
Information Management Portfolio(Information Server, MDM, Streams, etc)
Warehouse Accelerators
Flexible Integrated SystemTrue Appliance Custom Solution
FlexibilitySimplicity The right mix of simplicity and flexibility
© 2010 IBM Corporation
Information Management
Netezza Value Proposition
Speed: Price/performance leader using hardware-based data streaming
Simplicity: Black-box appliance with no tuning or storage administration provides low TCO and fast time to value
Scalability: True MPP enables customers to conduct rapid queries and analytics on petabyte sized data warehouses
Smart: Built-in advanced analytics pushed deep into database delivers analytics to the masses
© 2010 IBM Corporation
Information Management
Netezza and ISAS
Choose Netezza:
If the best price/performance is required
If customer cannot afford too much tuning and administration
If customer need the fastest time to value
If customer does not want to pay for a separate database software license
Choose ISAS:
If AIX is preferred
If SAN and remote mirroring are required
If customer requires the warehouse to conform to the data center infrastructure standard
If customer likes a more customized warehouse
If customer need very specific/deep tuning techniques
© 2010 IBM Corporation
Information Management
Agenda Netezza Solution Highlight
– What is it?– Why it is good?
Netezza Tour– Specialized hardware and architecture– How Netezza operates in running database queries?
Netezza Simplicity
Netezza Performance
© 2010 IBM Corporation
Information Management
Netezza – Solution Highlight Summary
True Appliance– Hardware, software and storage pre-
built for data warehouse– With specially designed hardware
designed for high-performance advanced analytics operations
– Hardware compression based on table columns
Very fast– Usually 10x to 100x faster than
traditional database
Minimal administration and tuning
Low TCO
© 2010 IBM Corporation
Information Management
Legacy DWH Architectures:Moving large amounts of data becomes Bottleneck!!
Results
Query
StorageServerRDBMS SW
Data
Large amounts of data moved from
disk, causing bottleneck
Large amounts of data moved from
disk, causing bottleneck
Data is moved to memory, then
SQL processed
Data is moved to memory, then
SQL processed
© 2010 IBM Corporation
Information Management
Results
Netezza Performance Server™
Netezza Performance ServerWe are better in EDW operations with complex BI queries!
SMP Host(2-4 CPU)
Query
Network traffic:1% of existing systems
CPU :2% of existing systems
Data processed as streams from disk, before moved to
memory
Data processed as streams from disk, before moved to
memory
© 2010 IBM Corporation
Information Management
Agenda Netezza Solution Highlight
– What is it?– Why it is good?
Netezza Tour– Specialized hardware and architecture– How Netezza operates in running database queries?
Netezza Simplicity
Netezza Performance
© 2010 IBM Corporation
Information Management
The IBM Netezza TwinFin™ Appliance
Page 11
High-performance databaseengine streaming joins,aggregations, sorts, etc.
SQL CompilerQuery PlanOptimizeAdmin
Processor &streaming DB logic
Slice of User DataSwap and Mirror partitionsHigh speed data streaming
SMP Hosts
Snippet Blades™
(S-Blades™)
Disk Enclosures
© 2011 IBM Corporation13
S-Blade™ Components
Intel Quad-Core
Dual-Core FPGADRAM
IBM BladeCenter Server Netezza DB Accelerator
SAS Expander
Module
SAS Expander
Module
© 2010 IBM Corporation
Information Management
The IBM Netezza AMPP™ Architecture
Advanced AnalyticsAdvanced Analytics
LoaderLoader
ETLETL
BIBI
Applications
FPGA
Memory
CPU
FPGA
Memory
CPU
FPGA
Memory
CPU
HostsHost
Disk Enclosures S-Blades™
NetworkFabric
Netezza Appliance
ODBC/JDBC
ODBC/JDBC
© 2011 IBM Corporation16
Our Secret Sauce
FPGA Core CPU Core
Uncompress ProjectRestrict,Visibility
Complex ∑Joins, Aggs, etc.
select DISTRICT,
PRODUCTGRP,
sum(NRX)
from MTHLY_RX_TERR_DATA
where MONTH = '20091201'
and MARKET = 509123
and SPECIALTY = 'GASTRO'
Slice of table
MTHLY_RX_TERR_DATA
(compressed)
Slice of table
MTHLY_RX_TERR_DATA
(compressed)
where MONTH = '20091201'
and MARKET = 509123
and SPECIALTY = 'GASTRO'
where MONTH = '20091201'
and MARKET = 509123
and SPECIALTY = 'GASTRO'
sum(NRX)sum(NRX)
select DISTRICT,
PRODUCTGRP,
sum(NRX)
select DISTRICT,
PRODUCTGRP,
sum(NRX)
© 2011 IBM Corporation17
Netezza Eliminates the I/O BottleneckMove the SQL to the hardware… to where the data lives
“Just send the Answer,
not Raw Data”
© 2010 IBM Corporation
Information Management
Agenda Netezza Solution Highlight
– What is it?– Why it is good?
Netezza Tour– Specialized hardware and architecture– How Netezza operates in running database queries?
Netezza Simplicity
Netezza Performance
© 2010 IBM Corporation
Information Management
19
Why traditional database systems are not enough: Endless tuning
businessperson
Query performance
is slow
Query performance
is slow
© 2010 IBM Corporation
Information Management
20
Why traditional database systems are not enough: Endless tuning
businessperson
technicalperson
I’ll add an indexI’ll add an index
© 2010 IBM Corporation
Information Management
21
Why traditional database systems are not enough: Endless tuning
businessperson
Load performance is slow. When can
I access my data?
Load performance is slow. When can
I access my data?
© 2010 IBM Corporation
Information Management
22
Why traditional database systems are not enough: Endless tuning
businessperson
technicalperson
I’ll investigate and get back to you …I’ll investigate and get back to you …
© 2010 IBM Corporation
Information Management
23
Why traditional database systems are not enough: Endless tuning
businessperson
technicalperson
Okay… I will add an aggregate table to pre-calculate so
that the report will run faster.
Okay… I will add an aggregate table to pre-calculate so
that the report will run faster.
© 2010 IBM Corporation
Information Management
24
Why traditional database systems are not enough: Endless tuning
businessperson
I want my report to be refreshed every 1 hour.
I want my report to be refreshed every 1 hour.
© 2010 IBM Corporation
Information Management
25
Why traditional database systems are not enough: Endless tuning
businessperson
technicalperson
Oh… that is impossible… The
report will be updated once
everyday after night batch…
Oh… that is impossible… The
report will be updated once
everyday after night batch…
© 2010 IBM Corporation
Information Management
26
Why traditional database systems are not enough: Wasted effort
Task Description Transform InspectNon-value Process
Value-adding Process
move data from sources 120
reconcile data 20
sort and prep 30
drop indices 5
drop constraints 1
drop aggregates 2
drop materialized views 2
load data 30
create constraints 180
create indices 90
create materialized views 60
create aggregates 120
gather statistics 300
© 2010 IBM Corporation
Information Management
27
Solving the data load and query performance problem
“
We act out the market every day to capitalize on opportunities. Complex merchandize reports that had taken days to process on the old platform now take five minutes on the new one. Simpler queries are even faster.
-- Chief Information Officer at a large US retailer
“
Data loads jobs Oracle Netezza
1 + 5 hours 2 mins 53 secs
2 1 hour 12 mins 7 secs 3 mins 29 secs
3 1 hour 25 mins 56 secs 4 mins 20 secs
4 1hour 30 mins 00 secs 5 mins 42 secs
© 2010 IBM Corporation
Information Management
Netezza Loads Data at 2.5TB per Hour
0
500
1000
1500
2000
2500
3000
1 ta
sk
2 ta
sks
3 ta
sks
4 ta
sks
5 ta
sks
6 ta
sks
7 ta
sks
8 ta
sks
9 ta
sks
10 ta
sks
11 ta
sks
12 ta
sks
13 ta
sks
14 ta
sks
15 ta
sks
Throughput (GB/Hr)
© 2010 IBM Corporation
Information Management
Operations• Simply load and go .… it’s an appliance
Minimal DBA Tuning• No configuration or physical modeling• No indexes– out of the box performance
ETL Developers• No aggregate tables needed
->Less ETL logic• Faster load and transformation times
Business Analysts• Train of thought analysis – 10 to 100x faster• True ad hoc queries – no tuning, no indexes• Ask complex queries against large datasets
Page 29
Netezza is Simple to Deploy Since it is so Fast
© 2011 IBM Corporation30
Traditional Complexity … Netezza Simplicity
0. CREATE DATABASE TEST LOGFILE 'E:\OraData\TEST\LOG1TEST.ORA' SIZE 2M, 'E:\OraData\TEST\LOG2TEST.ORA' SIZE 2M, 'E:\OraData\TEST\LOG3TEST.ORA' SIZE 2M, 'E:\OraData\
TEST\LOG4TEST.ORA' SIZE 2M, 'E:\OraData\TEST\LOG5TEST.ORA' SIZE 2M EXTENT MANAGEMENT LOCAL MAXDATAFILES 100 DATAFILE 'E:\OraData\TEST\SYS1TEST.ORA'
SIZE 50 M DEFAULT TEMPORARY TABLESPACE temp TEMPFILE 'E:\OraData\TEST\TEMP.ORA' SIZE 50 M
UNDO TABLESPACE undo DATAFILE 'E:\OraData\TEST\UNDO.ORA' SIZE 50 M NOARCHIVELOG CHARACTER SET WE8ISO8859P1;
1. Oracle* table and indexes 2. Oracle tablespace 3. Oracle datafile 4. Veritas file 5. Veritas file system 6. Veritas striped logical volume 7. Veritas mirror/plex 8. Veritas sub-disk 9. SunOS raw device 10. Brocade SAN switch 11. EMC Symmetrix volume 12. EMC Symmetrix striped meta-volume 13. EMC Symmetrix hyper-volume 14. EMC Symmetrix remote volume (replication) 15. Days/weeks of planning meetings
Netezza: Low (ZERO) Touch:
CREATE DATABASE my_db;
© 2010 IBM Corporation
Information Management
31
Netezza Delivers Simplicity
Up and running 6 months before being trained
200X faster than Oracle system
ROI in less than 3 months
“ “
Allowing the business users access to the Netezza box was what sold it.
-- Steve Taff, Executive Dir. of IT Services
© 2010 IBM Corporation
Information Management
Agenda Netezza Solution Highlight
– What is it?– Why it is good?
Netezza Tour– Specialized hardware and architecture– How Netezza operates in running database queries?
Netezza Simplicity
Netezza Performance
© 2010 IBM Corporation
Information Management
POC - A Telco Company
Environment– Netezza TwinFin 12 full rack
Raw Data volume– Call Level Detail : 3TB (9 billion rows)– Financial Bill : 600GB (5.4 billion rows)– Customer Info : 60GB (91.1 million rows)
© 2010 IBM Corporation
Information Management
POC: Data Maintenance
Testing scenarios Raw Data involved Elapsed Time
Insert data to an empty table from a table with 750 million records
270GB 3m47s
Concurrently insert data to 5 empty tables from 5 tables with 550 million records each respectively
990GB = 198GB x 5 11m35s
Update all rows of a table with 1.5 billion records
270GB 9m40s
Concurrently update all rows for 5 tables with 1.1 billion records each
1,350GB = 270GB x 5 44m44s
Delete 183 million records from a table with 550 million records
198GB 1m32s
Concurrently delete 183 million records from 5 tables with 550 million records each
990GB = 198GB x 5 9m43s
© 2010 IBM Corporation
Information Management
POC: Enquiries (With NO Indexes)
Testing scenarios Table involved Elapsed Time for Single Task
Elapsed Time for Concurrent Tasks
Query on single fact table Call Level Detail (3TB, 9 billion records)
1m16s 2m49s (10 tasks)
Query on joining 2 fact tables Call Level Detail (3TB, 9 Billion records) Customer Info (60GB, 91.1 million records)
22s 58s (5 tasks)
Query to get the top 100 call duration time in different groups for a particular month
Call Level Detail (3TB, 9 Billion records) Customer Info (60GB, 91.1 million records)
1m53s 4m10s (5 tasks)
Query on joining a fact table to multiple dimension tables
Customer Info (60GB, 91.1 million records) 5 dimension tables
5s 14s (5 tasks)
Query on joining a fact table with subquery involve joining another two fact tables
Call Level Detail (3TB, 9 Billion records) Customer Info (60GB, 91.1 million records) Financial Bill (600GB, 5.4 billion records)
21s 1m55s (5 tasks)
© 2010 IBM Corporation
Information Management
36
Catalina Marketing: Building loyalty one customer at a time
No targeting Basic targeting e.g., offer dog food
coupon to customer buying dog food
Using predictive models to find latent
correlations
Coupon redemption rate
1% 6-10% 25%
Marketing to a segment of one – 195 million US loyalty program members
– Every coupon printed is unique to the individual customer
– Customized based on three years' worth of purchase history
Increased staff productivity – from 50 to 600 new models per year
Increased efficiency – from 4 hours to score a model to 60 seconds