System Architecture: Big Iron (NUMA)
About Joe ChangAbout Joe Chang
SQL Server Execution Plan Cost Model
True cost structure by system architecture
Decoding statblob (distribution statistics)
SQL Clone – statistics-only database
ToolsExecStats – cross-reference index use by SQL-execution plan
Performance Monitoring,
Profiler/Trace aggregation
Scaling SQL on NUMA TopicsScaling SQL on NUMA Topics
OLTP – Thomas Kejser session“Designing High Scale OLTP Systems”
Data Warehouse
Ongoing Database DevelopmentOngoing Database Development
Bulk Load – SQL CAT paper + TK session
“The Data Loading Performance Guide”Other Sessions with common coverage:Monitoring and Tuning Parallel Query Execution II, R Meyyappan(SQLBits 6) Inside the SQL Server Query Optimizer, Conor CunninghamNotes from the field: High Performance Storage, John LangfordSQL Server Storage – 1000GB Level, Brent Ozar
Symmetric Multi-ProcessingSymmetric Multi-Processing
CPU CPU
System Bus
CPU CPU
MCH
ICHPXH PXH
SMP, processors are not dedicated to specific tasks (ASMP), single OS image, each processor can acess all memory
SMP makes no reference to memory architecture?
Not to be confused to Simultaneous Multi-Threading (SMT)Intel calls SMT Hyper-Threading (HT), which is not to be confused with AMD Hyper-Transport (also HT)
Non-Uniform Memory AccessNon-Uniform Memory Access
MemoryController
CPU CPU CPU CPU
Shared Bus or X Bar
MemoryController
CPU CPU CPU CPU
MemoryController
CPU CPU CPU CPU
MemoryController
CPU CPU CPU CPU
Node Controller Node Controller Node Controller Node Controller
NUMA Architecture - Path to memory is not uniform1) Node: Processors, Memory, Separate or combined
Memory + Node Controllers 2) Nodes connected by shared bus, cross-bar, ring
Traditionally, 8-way+ systems
Local memory latency ~150ns, remote node memory ~300-400ns, can cause erratic behavior if OS/code is not NUMA aware
AMD OpteronAMD Opteron
Opteron
Opteron
HT2100
Opteron
Opteron
HT1100HT2100
Local memory latency ~50ns, 1 hop ~100ns, two hop 150ns?Actual: more complicated because of snooping (cache coherency traffic)
Technically, Opteron is NUMA, but remote node memory latency is low, no negative impact or erratic behavior!For practical purposes: behave like SMP system
8-way Opteron Sys Architecture8-way Opteron Sys Architecture
Opteron processor (prior to Magny-Cours) has 3 Hyper-Transport links. Note 8-way top and bottom right processors use 2 HT to connect to other processors, 3rd HT for IO, CPU 1 & 7 require 3 hops to each other
CPU0
CPU2
CPU4
CPU6
CPU1
CPU3
CPU5
CPU7
Nehalem System ArchitectureNehalem System Architecture
Intel Nehalem generation processors have Quick Path Interconnect (QPI) Xeon 5500/5600 series have 2, Xeon 7500 series have 4 QPI 8-way Glue-less is possible
NUMA Local and Remote MemoryNUMA Local and Remote Memory
Local memory is closer than remote
Physical access time is shorter
What is actual access time?With cache coherency requirement!
Source Snoop CoherencySource Snoop Coherency
From HP PREMA Architecture whitepaper:
All reads result in snoops to all other caches, … Memory controller cannot return the data until it has collected all the snoop responses and is sure that no cache provided a more recent copy of the memory line
DL980G7DL980G7
From HP PREAM Architecture whitepaper:Each node controller stores information about* all data in the processor caches, minimizes inter-processor coherency communication, reduces latency to local memory(*only cache tags, not cache data)
HP ProLiant DL980 ArchitectureHP ProLiant DL980 Architecture
Node Controllers reduces effective memory latency
Superdome 2 – Itanium, sx3000Superdome 2 – Itanium, sx3000
Agent – Remote Ownership Tag + L4 cache tags
64M eDRAM L4 cache data
OS Memory ModelsOS Memory Models
SUMA: Sufficiently Uniform Memory AccessMemory interleaved across nodes
NUMA: first interleaved within a node, then spanned across nodes
Memory stripe is then spanned across nodes
1
2
1
2
Node 0
4933171
4832160
5135193
5034182
Node 0
5337215
5236204
5539237
5438226
Node 0
5741259
5640248
59432711
58422610
Node 0
61452913
60442812
63473115
62463014
Node 0
13951
12840
151173
141062
Node 0
29252117
28242016
31272319
30262218
Node 0
45413733
44403632
47433935
46423834
Node 0
61575349
60565248
63595551
62585450
OS Memory ModelsOS Memory Models
251791
Node 0
241680
2719113
Node 1
2618102
2921135
Node 2
2820124
3123157
Node 3
3022146
7531
Node 0
6420
1513119
Node 1
1412108
23211917
Node 2
22201816
31292725
Node 3
30282624
SUMA: Sufficiently Uniform Memory AccessMemory interleaved across nodes
NUMA: first interleaved within a node, then spanned across nodes
Memory stripe is then spanned across nodes
1
2
1
2
Windows OS NUMA SupportWindows OS NUMA Support
Memory modelsSUMA – Sufficiently Uniform Memory Access
NUMA – separate memory pools by Node
Node 0
0
24168
1
25179
Node 1
2
261810
3
271911
Node 2
4
282012
5
292113
Node 3
6
302214
7
312315
Node 0
0
642
1
753
Node 1
8
141210
9
151311
Node 2
16
222018
17
232119
Node 3
24
302826
25
312927
Memory is striped across NUMA nodes
Memory Model Example: 4 NodesMemory Model Example: 4 Nodes
SUMA Memory Modelmemory access uniformly distributed
25% of memory accesses local, 75% remote
NUMA Memory ModelGoal is better than 25% local node access
True local access time also needs to be faster
Cache Coherency may increase local access
Architecting for NUMAArchitecting for NUMA
Web determines port for each user by group (but should not be by geography!)
Affinitize port to NUMA node
Each node access localized data (partition?)
OS may allocate substantial chunk from Node 0?
End to End Affinity
North East
Mid Atlantic
South East
Central
Texas
Mountain
California
Pacific NW
1440
1441
1442
1443
1444
1445
1446
1447
Node 0
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
Node 7
0-0
0-1
1-0
1-1
2-0
2-1
3-0
3-1
4-0
4-1
5-0
5-1
6-0
6-1
7-0
7-1
NE
MidA
SE
Cen
Tex
Mnt
Cal
PNW
App Server TCP Port CPU Memory Table
Architecting for NUMAArchitecting for NUMA
Web determines port for each user by group (but should not be by geography!)
Affinitize port to NUMA node
Each node access localized data (partition?)
OS may allocate substantial chunk from Node 0?
End to End Affinity
North East
Mid Atlantic
South East
Central
Texas
Mountain
California
Pacific NW
1440
1441
1442
1443
1444
1445
1446
1447
Node 0
Node 1
Node 2
Node 3
Node 4
Node 5
Node 6
Node 7
0-0
0-1
1-0
1-1
2-0
2-1
3-0
3-1
4-0
4-1
5-0
5-1
6-0
6-1
7-0
7-1
NE
MidA
SE
Cen
Tex
Mnt
Cal
PNW
App Server TCP Port CPU Memory Table
HP-UX LORAHP-UX LORA
HP-UX – Not Microsoft Windows
Locality-Optimizer Resource Alignment
12.5% Interleaved Memory
87.5% NUMA node Local Memory
System Tech SpecsSystem Tech Specs
8GB $400 ea 18 x 8G = 144GB, $7200, 64 x 8G = 512GB - $26K
16GB $1100 ea 12 x16G =192GB, $13K, 64 x 16G = 1TB – $70K
Processors
2 x Xeon X56x0
4 x Opteron 6100
4 x Xeon X7560
8 x Xeon X7560
Cores DIMM PCI-E G2
6 18 5 x8+,1 x4
12 32 5 x8, 1 x4
8 64 4 x8, 6 x4†
8 128 9 x8, 5 x4‡
Max memory
192G*
512G
1TB
2TB
Total Cores
12
48
32
64
Base
$7K
$14K
$30K
$100K
Max memory for 2-way Xeon 5600 is 12 x 16 = 192GB,† Dell R910 and HP DL580G7 have different PCI-E ‡ ProLiant DL980G7 can have 3 IOH for additional PCI-E slots
Operating SystemOperating System
Windows Server 2003 RTM, SP1Network limitations (default)
Scalable Networking Pack (912222)
Windows Server 2008
Windows Server 2008 R2 (64-bit only)
Breaks 64 logical processor limit
NUMA IO enhancements?Do not bother trying to do DW on 32-bit OS or 32-bit SQL ServerDon’t try to do DW on SQL Server 2000
Impacts OLTP
Search: MSI-X
SQL Server versionSQL Server version
SQL Server 2000Serious disk IO limitations (1GB/sec ?)
Problematic parallel execution plans
SQL Server 2005 (fixed most S2K problems)
64-bit on X64 (Opteron and Xeon)
SP2 – performance improvement 10%(?)
SQL Server 2008 & R2Compression, Filtered Indexes, etc
Star join, Parallel query to partitioned table
ConfigurationConfiguration
SQL Server Startup Parameter: ETrace Flags 834, 836, 2301
Auto_Date_CorrelationOrder date < A, Ship date > A
Implied: Order date > A-C, Ship date < A+C
Port Affinity – mostly OLTP
Dedicated processor ?for log writer ?