nenug apr14 talk - data modeling for netezza
DESCRIPTION
TRANSCRIPT
Data Modeling and Netezza
Biju Nair NENUG Talk
30-‐Apr-‐2014
Mo@va@on and Goal
• Mo@va@on – Performance degrada@on with data volume
• Goal – Highlight considera@ons while modeling for NZ
2
Data Modeling
• Logical Data Modeling – Business domain data representa@on – Independent of DBMS technology – NZ is an appliance for analy@cs
• Set based processing vs row based processing • De-‐normaliza@on • Snow flake/Star schema
• Physical Data Modeling – Takes in the DBMS features and constraints – Need to understand the DBMS architecture
3
NZ Architecture
Host
Snippet Processors
Snippet Processors
Snippet Processors
Data Data Data
-‐ Parse, Op+mize and Compile query -‐ Schedule snippets -‐ Distribute data
Executes snippets
-‐ Shared nothing MPP -‐ Custom IP backbone
-‐ Appliance efficiency is maximized when -‐ snippet processors can run independently i.e. data independence -‐ All snippet processors are u@lized uniformly 4
Snippet Processors
Data
Accelerator (FPGA) CPU
Compute
Host
Reads compressed
data
Un-‐compress Remove columns
Restrict rows (where)
Perform computa+on
Send data to host
-‐ Disk reads are incredibly slow rela@ve to other components especially seek @me -‐ While the CPU overhead is reduced, volume of data read will impact performance
5
Data Storage
Data
Snippet Processors
Host
Extend
Page
Extend
Extend
…
Meta-‐Data
-‐ Meta data iden@fies extends/pages to read or skip 6
Modeling Priori@es
• U@liza@on of all snippet processors – Need to be able to u@lize uniformly
• Maximize MPP capability of snippet processors – Ideally snippet processors should be independent
• Minimize data read from disk – Minimize data stored
• Improve computa@on in snippet processor – Compounded with data volume will help performance
7
Snippet Processor U@liza@on Data Distribu+on
Host
Snippet Processors
Snippet Processors
Snippet Processors
1,MA,1212,… 3,MA,0414,…
2,CA,0113,…
1,MA,1212,… 2,CA,0113,… 3,MA,0414,…
Distribute by state
-‐ NZ will pick one of the columns to distribute if none specified in table defini@on -‐ First column in the table
Data Skew
8
Snippet Processor U@liza@on Data Distribu+on
Host
Snippet Processors
Snippet Processors
Snippet Processors
1,MA,1212,… 2,CA,0113,… 3,MA,0414,…
1,MA,1212,… 2,CA,0113,… 3,MA,0414,…
Distribute by mo-‐yr
-‐ Snippet processors are u@lized uniformly -‐ What if most of the query is on for the current month?
-‐ Processing skew 9
Snippet Processor U@liza@on Data Distribu+on
Host
Snippet Processors
Snippet Processors
Snippet Processors
1,MA,1212,… 4,CA,0414,…
2,CA,0113,… 3,MA,0414,…
1,MA,1212,… 2,CA,0113,… 3,MA,0414,… 4,CA,0414,…
Distribute random
-‐ Snippet processors are u@lized uniformly -‐ Helps prevent processing skew
10
3,ORD1,ITEM1,… 4,ORD1,ITEM1,… 2,ORD1,ITEM1,…
1,ORD1,ITEM1,…
Snippet Processor U@liza@on Data Distribu+on and Table Joins
Host
Snippet Processors
Snippet Processors
Snippet Processors
4,CA,0414,… 1,MA,1212,… 3,MA,0414,…
2,CA,0113,…
1,ORD1,ITEM1,… 2,ORD1,ITEM1,… 3,ORD1,ITEM1,… 4,ORD1,ITEM1,…
Distribute random
-‐ Snippet processors are u@lized uniformly -‐ Makes snippet processors dependent on others impac@ng MPP maximiza@on
Need to redistribute data from both tables
11
3,ORD1,ITEM1,… 2,ORD1,ITEM1,… 1,ORD1,ITEM1,… 4,ORD1,ITEM1,…
Snippet Processor U@liza@on Data Distribu+on and Table Joins
Host
Snippet Processors
Snippet Processors
Snippet Processors
4,CA,0414,… 1,MA,1212,… 3,MA,0414,…
2,CA,0113,…
1,ORD1,ITEM1,… 2,ORD1,ITEM1,… 3,ORD1,ITEM1,… 4,ORD1,ITEM1,…
Distribute on Join column -‐ cid
-‐ Snippet processors are u@lized uniformly -‐ Makes snippet processors dependent on others impac@ng MPP maximiza@on
-‐ Becer than the previous scenario
Need to redistribute data from one table
12
3,ORD1,ITEM1,… 2,ORD1,ITEM1,… 1,ORD1,ITEM1,… 4,ORD1,ITEM1,…
Snippet Processor U@liza@on Data Distribu+on and Table Joins
Host
Snippet Processors
Snippet Processors
Snippet Processors
1,MA,1212,… 4,CA,0414,…
2,CA,0113,… 3,MA,0414,…
Distribute both tables on Join column -‐ cid
-‐ Snippet processors are u@lized uniformly -‐ Makes snippet processors independent maximizing MPP
13
Snippet Processor U@liza@on Data Distribu+on
• Iden@fy keys to distribute data uniformly – Avoid data and processing skew – Try using join columns as the distribu@on keys – Choose same data types for join columns
• If table size is small random distribu@on is fine – If one of the join table is small, NZ will broadcast
• Redistribu@on may not be an overkill for small data – For e.g., selec@ng a small number if columns
14
Snippet Processor U@liza@on Distribu+on and Query Time
0
0.5
1
1.5
2
2.5
3
3.5
Random 1 Correct Distribu@on 2 Correct Distribu@on
Query Time For Different Distribu+ons
Time (min)
15
Snippet Processor U@liza@on Join Column Type and Query Time
0
0.5
1
1.5
2
2.5
Incorrect Data Types Correct Data Types
Join query +me -‐ same and diff data types
Time (min)
16
Minimize Data Read From Disk Zone Maps
• Data types which supports Zone Maps – All integer data types
• int1 • int2 • int4 • int8
– Date – Timestamp
17
Refer to the product manual for the version of NZ used for the complete list of zone map able data types
Data
Snippet Processors
Host
Extend 1
Page
Extend 2
Extend 3
…
Meta-‐Data
-‐ May end up reading all data from disk
Minimize Data Read From Disk Zone Maps
Table column (cid) is numeric(10,0)
No zone map for cid
18
Data
Snippet Processors
Host
Extend
Page
Extend
Extend
…
Meta-‐Data
-‐ Zone maps can be used to minimize data read from disk
Minimize Data Read From Disk Zone Maps
Table column (cid) is bigint
19
Minimize Data Read From Disk Zone Maps and Query Time
1.3
1.35
1.4
1.45
1.5
1.55
1.6
Incorrect Data Type Correct Data Type
Query +me with and without zone map
Time (min)
20
Minimize Data Read From Disk Clustered Base Tables
• NZ stores data with same organize keys closely • Addi@onal data types are zone map able
– char – varchar – nchar – nvarchar – float – double – bool – @me – Interval
• Helps improve performance of mul@ table join
21
Extend
Extend
Extend
Minimize Data Read From Disk Clustered Base Tables
Table distributed on cid
1,MA,Boston,…
5,CA,LA,…
3,FL,Tampa,,…
1,MA,Salem,…
5,CA,SF,…
3,FL,Orlando,…
1,MA,Lowell,…
5,CA,Pasadena,… 3,FL,Miami,…
Extend
Extend
Extend
Table distributed on cid organize on state
1,MA,Boston,…
1,MA,Salem,…
1,MA,Lowell,…
3,FL,Tampa,,…
3,FL,Orlando,…
3,FL,Miami,…
5,CA,LA,…
5,CA,SF,…
5,CA,Pasadena,…
Extend
1,AL,Alabama,…
5,CA,California,… 10,FL,Florida,…
State
22,MA,Mass,…
22
Minimize Data Read From Disk Clustered Base Table
0
0.5
1
1.5
2
2.5
No Org+Correct Dist Org+Correct Dist
Query Time with and without org
Time (min)
23
Minimize Data Read From Disk Materialized Views
• View with frequently used columns of base table – Unlike views, materialized view stores data – Reduced data read from disk – Addi@onal storage required – Need to be refreshed if base table data changes
• Can be used as an index against base table – Stores loca@on of base table data loc in a column
24
Extend
Extend
Extend
Minimize Data Read From Disk Materialized Views
T_EMP distributed on cid
1,MA,Mike,…
5,CA,Fally,…
3,FL,Chris,…
4,MA,Robert,…
7,CA,Mary,…
2,FL,Jus+n,…
6,MA,Harini,…
8,CA,Mike,…
9,FL,Martha,…
SELECT ID, NAME FROM T_EMP;
Extend
MV on T_EMP order by state, name
5,CA,Fally
7,CA,Mary
8,CA,Mike
3,FL,Chris
2,FL,Jus+n
9,FL,Martha
6,MA,Harini
1,MA,Mike
4,MA,Robert
SELECT * FROM T_EMP WHERE STATE = ‘CA’
25
Minimize Data Read From Disk Materialized Views
0
0.2
0.4
0.6
0.8
1
1.2
1.4
No MV With MV
Query +me on single table -‐MV vs No MV
Time (min)
26
Minimize Data Read From Disk Materialized Views
0
0.2
0.4
0.6
0.8
1
1.2
1.4
1.6
1.8
No MV With MV
Join Query Time -‐ With and without MV
Time (min)
27
Minimize Data Read From Disk Minimize Data Stored
• Choose storage efficient data types – Difference between bigint and int is 4 bytes
• Use char instead of varchar if the data length is fixed – varchar has a 2 byte overhead
• Define columns as “not null” where possible
• Store only the required data in table columns
• Encode duplicate data stored in rows 28
Improve Computa@on in Snippet Processor NZ Object Defini+ons
• Define columns as “not null” where possible – Removes logic to check nulls
• Define table keys and rela@onships – Helps NZ query op@mizer to generate efficient code
29