system architecture: big iron (numa) joe chang [email protected]

33
System Architecture: Big Iron (NUMA) Joe Chang [email protected] www.qdpma.com

Upload: ferdinand-booth

Post on 24-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

System Architecture: Big Iron (NUMA)

Joe [email protected]

About Joe ChangAbout Joe Chang

SQL Server Execution Plan Cost Model

True cost structure by system architecture

Decoding statblob (distribution statistics)

SQL Clone – statistics-only database

ToolsExecStats – cross-reference index use by SQL-execution plan

Performance Monitoring,

Profiler/Trace aggregation

Scaling SQL on NUMA TopicsScaling SQL on NUMA Topics

OLTP – Thomas Kejser session“Designing High Scale OLTP Systems”

Data Warehouse

Ongoing Database DevelopmentOngoing Database Development

Bulk Load – SQL CAT paper + TK session

“The Data Loading Performance Guide”Other Sessions with common coverage:Monitoring and Tuning Parallel Query Execution II, R Meyyappan(SQLBits 6) Inside the SQL Server Query Optimizer, Conor CunninghamNotes from the field: High Performance Storage, John LangfordSQL Server Storage – 1000GB Level, Brent Ozar

Server Systems and ArchitectureServer Systems and Architecture

Symmetric Multi-ProcessingSymmetric Multi-Processing

CPU CPU

System Bus

CPU CPU

MCH

ICHPXH PXH

SMP, processors are not dedicated to specific tasks (ASMP), single OS image, each processor can acess all memory

SMP makes no reference to memory architecture?

Not to be confused to Simultaneous Multi-Threading (SMT)Intel calls SMT Hyper-Threading (HT), which is not to be confused with AMD Hyper-Transport (also HT)

Non-Uniform Memory AccessNon-Uniform Memory Access

MemoryController

CPU CPU CPU CPU

Shared Bus or X Bar

MemoryController

CPU CPU CPU CPU

MemoryController

CPU CPU CPU CPU

MemoryController

CPU CPU CPU CPU

Node Controller Node Controller Node Controller Node Controller

NUMA Architecture - Path to memory is not uniform1) Node: Processors, Memory, Separate or combined

Memory + Node Controllers 2) Nodes connected by shared bus, cross-bar, ring

Traditionally, 8-way+ systems

Local memory latency ~150ns, remote node memory ~300-400ns, can cause erratic behavior if OS/code is not NUMA aware

AMD OpteronAMD Opteron

Opteron

Opteron

HT2100

Opteron

Opteron

HT1100HT2100

Local memory latency ~50ns, 1 hop ~100ns, two hop 150ns?Actual: more complicated because of snooping (cache coherency traffic)

Technically, Opteron is NUMA, but remote node memory latency is low, no negative impact or erratic behavior!For practical purposes: behave like SMP system

8-way Opteron Sys Architecture8-way Opteron Sys Architecture

Opteron processor (prior to Magny-Cours) has 3 Hyper-Transport links. Note 8-way top and bottom right processors use 2 HT to connect to other processors, 3rd HT for IO, CPU 1 & 7 require 3 hops to each other

CPU0

CPU2

CPU4

CPU6

CPU1

CPU3

CPU5

CPU7

http://www.techpowerup.com/img/09-08-26/17d.jpg

Nehalem System ArchitectureNehalem System Architecture

Intel Nehalem generation processors have Quick Path Interconnect (QPI) Xeon 5500/5600 series have 2, Xeon 7500 series have 4 QPI 8-way Glue-less is possible

NUMA Local and Remote MemoryNUMA Local and Remote Memory

Local memory is closer than remote

Physical access time is shorter

What is actual access time?With cache coherency requirement!

HT Assist – Probe FilterHT Assist – Probe Filterpart of L3 cache used as directory cache

ZDNET

Source Snoop CoherencySource Snoop Coherency

From HP PREMA Architecture whitepaper:

All reads result in snoops to all other caches, … Memory controller cannot return the data until it has collected all the snoop responses and is sure that no cache provided a more recent copy of the memory line

DL980G7DL980G7

From HP PREAM Architecture whitepaper:Each node controller stores information about* all data in the processor caches, minimizes inter-processor coherency communication, reduces latency to local memory(*only cache tags, not cache data)

HP ProLiant DL980 ArchitectureHP ProLiant DL980 Architecture

Node Controllers reduces effective memory latency

Superdome 2 – Itanium, sx3000Superdome 2 – Itanium, sx3000

Agent – Remote Ownership Tag + L4 cache tags

64M eDRAM L4 cache data

IBM x3850 X5 (Glue-less)IBM x3850 X5 (Glue-less)

Connect two 4-socket Nodes to make 8-way system

Fujitsu R900Fujitsu R900

4 IOH14 x8 PCI-E slots, 2 x4, 1x8 internal

OS Memory ModelsOS Memory Models

SUMA: Sufficiently Uniform Memory AccessMemory interleaved across nodes

NUMA: first interleaved within a node, then spanned across nodes

Memory stripe is then spanned across nodes

1

2

1

2

Node 0

4933171

4832160

5135193

5034182

Node 0

5337215

5236204

5539237

5438226

Node 0

5741259

5640248

59432711

58422610

Node 0

61452913

60442812

63473115

62463014

Node 0

13951

12840

151173

141062

Node 0

29252117

28242016

31272319

30262218

Node 0

45413733

44403632

47433935

46423834

Node 0

61575349

60565248

63595551

62585450

OS Memory ModelsOS Memory Models

251791

Node 0

241680

2719113

Node 1

2618102

2921135

Node 2

2820124

3123157

Node 3

3022146

7531

Node 0

6420

1513119

Node 1

1412108

23211917

Node 2

22201816

31292725

Node 3

30282624

SUMA: Sufficiently Uniform Memory AccessMemory interleaved across nodes

NUMA: first interleaved within a node, then spanned across nodes

Memory stripe is then spanned across nodes

1

2

1

2

Windows OS NUMA SupportWindows OS NUMA Support

Memory modelsSUMA – Sufficiently Uniform Memory Access

NUMA – separate memory pools by Node

Node 0

0

24168

1

25179

Node 1

2

261810

3

271911

Node 2

4

282012

5

292113

Node 3

6

302214

7

312315

Node 0

0

642

1

753

Node 1

8

141210

9

151311

Node 2

16

222018

17

232119

Node 3

24

302826

25

312927

Memory is striped across NUMA nodes

Memory Model Example: 4 NodesMemory Model Example: 4 Nodes

SUMA Memory Modelmemory access uniformly distributed

25% of memory accesses local, 75% remote

NUMA Memory ModelGoal is better than 25% local node access

True local access time also needs to be faster

Cache Coherency may increase local access

Architecting for NUMAArchitecting for NUMA

Web determines port for each user by group (but should not be by geography!)

Affinitize port to NUMA node

Each node access localized data (partition?)

OS may allocate substantial chunk from Node 0?

End to End Affinity

North East

Mid Atlantic

South East

Central

Texas

Mountain

California

Pacific NW

1440

1441

1442

1443

1444

1445

1446

1447

Node 0

Node 1

Node 2

Node 3

Node 4

Node 5

Node 6

Node 7

0-0

0-1

1-0

1-1

2-0

2-1

3-0

3-1

4-0

4-1

5-0

5-1

6-0

6-1

7-0

7-1

NE

MidA

SE

Cen

Tex

Mnt

Cal

PNW

App Server TCP Port CPU Memory Table

Architecting for NUMAArchitecting for NUMA

Web determines port for each user by group (but should not be by geography!)

Affinitize port to NUMA node

Each node access localized data (partition?)

OS may allocate substantial chunk from Node 0?

End to End Affinity

North East

Mid Atlantic

South East

Central

Texas

Mountain

California

Pacific NW

1440

1441

1442

1443

1444

1445

1446

1447

Node 0

Node 1

Node 2

Node 3

Node 4

Node 5

Node 6

Node 7

0-0

0-1

1-0

1-1

2-0

2-1

3-0

3-1

4-0

4-1

5-0

5-1

6-0

6-1

7-0

7-1

NE

MidA

SE

Cen

Tex

Mnt

Cal

PNW

App Server TCP Port CPU Memory Table

HP-UX LORAHP-UX LORA

HP-UX – Not Microsoft Windows

Locality-Optimizer Resource Alignment

12.5% Interleaved Memory

87.5% NUMA node Local Memory

System Tech SpecsSystem Tech Specs

8GB $400 ea 18 x 8G = 144GB, $7200, 64 x 8G = 512GB - $26K

16GB $1100 ea 12 x16G =192GB, $13K, 64 x 16G = 1TB – $70K

Processors

2 x Xeon X56x0

4 x Opteron 6100

4 x Xeon X7560

8 x Xeon X7560

Cores DIMM PCI-E G2

6 18 5 x8+,1 x4

12 32 5 x8, 1 x4

8 64 4 x8, 6 x4†

8 128 9 x8, 5 x4‡

Max memory

192G*

512G

1TB

2TB

Total Cores

12

48

32

64

Base

$7K

$14K

$30K

$100K

Max memory for 2-way Xeon 5600 is 12 x 16 = 192GB,† Dell R910 and HP DL580G7 have different PCI-E ‡ ProLiant DL980G7 can have 3 IOH for additional PCI-E slots

Software StackSoftware Stack

Operating SystemOperating System

Windows Server 2003 RTM, SP1Network limitations (default)

Scalable Networking Pack (912222)

Windows Server 2008

Windows Server 2008 R2 (64-bit only)

Breaks 64 logical processor limit

NUMA IO enhancements?Do not bother trying to do DW on 32-bit OS or 32-bit SQL ServerDon’t try to do DW on SQL Server 2000

Impacts OLTP

Search: MSI-X

SQL Server versionSQL Server version

SQL Server 2000Serious disk IO limitations (1GB/sec ?)

Problematic parallel execution plans

SQL Server 2005 (fixed most S2K problems)

64-bit on X64 (Opteron and Xeon)

SP2 – performance improvement 10%(?)

SQL Server 2008 & R2Compression, Filtered Indexes, etc

Star join, Parallel query to partitioned table

ConfigurationConfiguration

SQL Server Startup Parameter: ETrace Flags 834, 836, 2301

Auto_Date_CorrelationOrder date < A, Ship date > A

Implied: Order date > A-C, Ship date < A+C

Port Affinity – mostly OLTP

Dedicated processor ?for log writer ?