Database Architectures for New Hardware
a tutorial byAnastassia Ailamaki
Database GroupCarnegie Mellon University
http://www.cs.cmu.edu/~natassa
@Carnegie MellonDatabases
2©2004 Anastassia Ailamaki
Focus of this tutorialDB workload execution on a modern computer
Processor
0%
20%
40%
60%
80%
100%
Ideal seq.scan
indexscan
DSS OLTP
exec
utio
n tim
e
BUSY IDLE
DBMS can run MUCH faster if they use new hardware efficiently
@Carnegie MellonDatabases
3©2004 Anastassia Ailamaki
Trends in processor performanceScaling # of transistors, innovative microarchitectureHigher performance, despite technological hurdles!
Processor speed doubles every 18 months
@Carnegie MellonDatabases
4©2004 Anastassia Ailamaki
Trends in Memory (DRAM) PerformanceMemory capacity increases exponentially
DRAM Fabrication primarily targets densitySpeed increases linearly
Larger but not as much faster memories
16MB4MB
1MB64KB 256KB
64MB
4GB
512MB
0.1
1
10
100
1000
10000
1980 1983 1986 1989 1992 1995 2000 2005
DRAM sizeD R A M S P E E D T R E N D S
YEAR OF INTRODUCTION1980 1982 1984 1986 1988 1990 1992 1994
SPEE
D(n s )
0
50
100
150
200
250SLOWEST RAS (ns)
FASTEST RAS (ns)CAS (ns)
CYCLE TIME (ns)
1 Mbit
64 Mbit
16 Mbit
4 Mbit
256 Kbit
64 Kbit
AC
CES
S TI
ME
(µs)
@Carnegie MellonDatabases
5©2004 Anastassia Ailamaki
The Memory/Processor Speed Gap
Trip to memory = thousands of instructions!
0.25
10
0.0625
80
6
0.01
0.1
1
10
100
1000pr
oces
sor
cycl
es /
inst
ruct
ion
0.01
0.1
1
10
100
1000
cycl
es /
acce
ss to
DR
AM
CPUMemory
VAX/1980 PPro/1996 2010+
@Carnegie MellonDatabases
6©2004 Anastassia Ailamaki
100G
New Hardware
Caches trade off capacity for speedExploit instruction/data localityDemand fetch/wait for data
[ADH99]:Running top 4 database systemsAt most 50% CPU utilization
But wait a minute…Isn’t I/O the bottleneck??? MemoryMemory
CCPPUU
1000
clk
100
clk
1 cl
k10
clk
L2 2M
L1 64K
4GBto
1TB
L3 32M
@Carnegie MellonDatabases
7©2004 Anastassia Ailamaki
Modern storage managers
Several decades work to hide I/OAsynchronous I/O + Prefetch & Postwrite
Overlap I/O latency by useful computationParallel data access
Partition data on modern disk array [PAT88]Smart data placement / clustering
Improve data localityMaximize parallelismExploit hardware characteristics
…and larger main memories fit more data1MB in the 80’s, 10GB today, TBs coming soon
DB storage mgrs efficiently hide I/O latencies
@Carnegie MellonDatabases
8©2004 Anastassia Ailamaki
Why should we (databasers) care?
0.330.8
1.4
DB
4
DB
Theoreticalminimum
Desktop/Engineering
(SPECInt)
DecisionSupport (TPC-H)
OnlineTransactionProcessing
(TPC-C)
Cyc
les
per i
nstr
uctio
n
Database workloads under-utilize hardwareNew bottleneck: Processor-memory delays
@Carnegie MellonDatabases
9©2004 Anastassia Ailamaki
Breaking the Memory Wall
Wish for a Database Architecture:that uses hardware intelligentlythat won’t fall apart when new computers arrivethat will adapt to alternate configurations
Efforts from multiple research communities Cache-conscious data placement and algorithmsInstruction stream optimizationsNovel database software architecturesNovel hardware designs (covered briefly)
@Carnegie MellonDatabases
10©2004 Anastassia Ailamaki
Detailed OutlineIntroduction and OverviewNew Hardware
Execution PipelinesCache memories
Where Does Time Go?Measuring Time (Tools and Benchmarks)Analyzing DBs: Experimental Results
Bridging the Processor/Memory Speed GapData PlacementAccess MethodsQuery Processing AlorithmsInstruction Stream OptimizationsStaged Database Systems
Newer HardwareHip and Trendy
Query co-processingDatabases on MEMStore
Directions for Future Research
@Carnegie MellonDatabases
11©2004 Anastassia Ailamaki
Outline
Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research
@Carnegie MellonDatabases
12©2004 Anastassia Ailamaki
This Section’s Goals
Understand how a program is executedHow new hardware parallelizes executionWhat are the pitfalls
Understand why database programs do not take advantage of microarchitectural advancesUnderstand memory hierarchies
How they workWhat are the parameters that affect program behaviorWhy they are important to database performance
@Carnegie MellonDatabases
13©2004 Anastassia Ailamaki
Outline
Introduction and OverviewNew Hardware
Execution PipelinesCache memories
Where Does Time Go?Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research
@Carnegie MellonDatabases
14©2004 Anastassia Ailamaki
Sequential Program Execution
Sequential Code
Precedences: overspecificationsSufficient, NOT necessary for correctness
i1: xxxx
i2: xxxx
i3: xxxx
i1
i2
i3
Instruction-level Parallelism (ILP)
i1 i2 i3
pipeliningsuperscalar execution
Modern processors do both!
@Carnegie MellonDatabases
15©2004 Anastassia Ailamaki
fetch decode execute memory write
Pipelined Program Execution
Ft0
DF
t1ED
t2
F
ME
t3
D
WM
t4
E
t5
WM W
Tpipeline = Tbase / 5
Inst2Inst1
Inst3
FETCH
Inst
ruct
ion
stre
am
EXECUTE RETIRE
Write results
@Carnegie MellonDatabases
16©2004 Anastassia Ailamaki
Pipeline Stalls (delays)Reason: dependencies between instructions E.g., Inst1: r1 ← r2 + r3
Inst2: r4 ← r1 + r2
F D E M WF D E M W
t0 t1 t2 t3 t4 t5Inst1Inst2
F D E M W
Read-after-write (RAW)
DB programs: frequent data dependencies
F D E M WF D E M W
t0 t1 t2 t3 t4 t5Inst1Inst2 E Stall
F E MD Stall D
peak ILP = d
Peak instruction-per-cycle (IPC) = CPI = 1Inst3
@Carnegie MellonDatabases
17©2004 Anastassia Ailamaki
Higher ILP: Superscalar Out-of-Order
F D E M Wt0 t1 t2 t3 t4 t5
Inst1…n
Peak instruction-per-cycle (IPC)=n (CPI=1/n)
F D E M WF D E M W
Inst(n+1)…2n
Inst(2n+1)…3n
at most n
peak ILP = d*n
Out-of-order (as opposed to “inorder”) execution:Shuffle execution of independent instructionsRetire instruction results using a reorder bufferDB: 1.5x faster than inorder [KPH98,RGA98]
Limited ILP opportunity
@Carnegie MellonDatabases
18©2004 Anastassia Ailamaki
true:
fetc
h B
Even Higher ILP: Branch PredictionWhich instruction block to fetch?
Evaluating a branch condition causes pipeline stall
C?IDEA: Speculate branch while evaluating C!
Record branch history in a buffer, predict A or BIf correct, saved a (long) delay!If incorrect, misprediction penalty=Flush pipeline, fetch correct
instruction streamExcellent predictors (97% accuracy!)Mispredictions costlier in OOO
1 lost cycle = >1 missed instructions!
false
: fet
ch A
xxxxif C goto B
A: xxxxxxxxxxxxxxxx
B: xxxxxxxxxxxxxxxxxxxxxxxxDB programs: long code paths => mispredictions
@Carnegie MellonDatabases
19©2004 Anastassia Ailamaki
Outline
Introduction and OverviewNew Hardware
Execution PipelinesCache memories
Where Does Time Go?Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research
@Carnegie MellonDatabases
20©2004 Anastassia Ailamaki
Larger
Faster
Memory HierarchyMake common case fast
common: temporal & spatial localityfast: smaller, more expensive memory
Keep recently accessed blocks (temporal locality)Group data into blocks (spatial locality)
Registers
Caches
Memory
DisksDB programs: >50% load/store instructions
@Carnegie MellonDatabases
21©2004 Anastassia Ailamaki
Cache Contents
Keep recently accessed block in “cache line”
address state data
On memory readif incoming address = a stored address tag then
HIT: return dataelse
MISS: choose & displace a line in usefetch new (referenced) block from memory into linereturn data
Important parameters:cache size, cache line size, cache associativity
@Carnegie MellonDatabases
22©2004 Anastassia Ailamaki
Cache Associativitymeans # of lines a block can be in (set size)Replacement: LRU or random, within set
Fully-associativea block goes in
any frame
Direct-mappeda block goes in
exactly one frame
Set-associativea block goes in any frame in
exactly one set
0
1
2
3
01234567
01010101
01234567
Line Set/Line Set
lower associativity ⇒ faster lookup
@Carnegie MellonDatabases
23©2004 Anastassia Ailamaki
Miss Classification (3+1 C’s)
compulsory (cold)“cold miss” on first access to a block
— defined as: miss in infinite cache
capacity misses occur because cache not large enough— defined as: miss in fully-associative cache
conflictmisses occur because of restrictive mapping strategyonly in set-associative or direct-mapped cache— defined as: not attributable to compulsory or capacity
coherencemisses occur because of sharing among multiprocessors
Cold misses are unavoidableCapacity, conflict, and coherence misses can be reduced
@Carnegie MellonDatabases
24©2004 Anastassia Ailamaki
Lookups in Memory Hierarchy
referencesmissesratemiss
## =EXECUTION PIPELINE
MAIN MEMORY
L1 I-CACHE L1 D-CACHE
L2 CACHE
Trips to memory are most expensive
$$$
L1: Split, 16-64K each.As fast as processor (1 cycle)L2: Unified, 512K-8MOrder of magnitude slower than L1
Memory: Unified, 512M-8GB~400 cycles (Pentium4)
(there may be more cache levels)
@Carnegie MellonDatabases
25©2004 Anastassia Ailamaki
Miss penalty
L1D: low miss penalty, if L2 hit (partly overlapped with OOO execution)
enalty)avg(miss pmiss rate*ttavg hitaccess +=)(
EXECUTION PIPELINE
MAIN MEMORY
L1 I-CACHE L1 D-CACHE
L2 CACHE
DB: long code paths, large data footprints
$$$L2: High penalty (trip to memory)
means the time to fetch and deliver block
L1I: In critical execution path.Cannot be overlapped with OOO execution.
Modern caches: non-blocking
@Carnegie MellonDatabases
26©2004 Anastassia Ailamaki
Typical processor microarchitecture
I-Unit E-Unit
L1 I-Cache L1 D-Cache
L2 Cache (SRAM on-chip)
D-TLBI-TLB
Regs
Main Memory (DRAM)
Processor
L3 Cache (SRAM off-chip)
Will assume a 2-level cache in this talk
TLB: Translation Lookaside Buffer (page table cache)
@Carnegie MellonDatabases
27©2004 Anastassia Ailamaki
Summary: New Hardware
Fundamental goal in processor design: max ILPPipelined, superscalar, speculative executionOut-of-order executionNon-blocking cachesDependencies in instruction stream lower ILP
Deep memory hierarchiesCaches important for database performanceLevel 1 instruction cache in critical execution pathTrips to memory most expensive
DB workloads on new hardwareToo many load/store instructionsTight dependencies in instruction streamAlgorithms not optimized for cache hierarchiesLong code pathsLarge instruction and data footprints
@Carnegie MellonDatabases
28©2004 Anastassia Ailamaki
Outline
Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research
@Carnegie MellonDatabases
29©2004 Anastassia Ailamaki
This Section’s Goals
Hardware takes time: how do we measure time?Understand how to efficiently analyze microarchitectural behavior of database workloads
Should we use simulators? When? Why?How do we use processor counters?Which tools are available for analysis?Which database systems/benchmarks to use?
Survey experimental results on workload characterization
Discover what matters for database performance
@Carnegie MellonDatabases
30©2004 Anastassia Ailamaki
Outline
Introduction and OverviewNew HardwareWhere Does Time Go?
Measuring Time (Tools and Benchmarks)Analyzing DBs: Experimental Results
Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research
@Carnegie MellonDatabases
31©2004 Anastassia Ailamaki
Simulator vs. Real MachineReal machine
Limited to available hardware counters/eventsLimited to (real) hardware configurations Fast (real-life) execution
Enables testing real: large & more realistic workloads
Sometimes not repeatable
Tool: performance counters
Real-machine experiments to locate problemsSimulation to evaluate solutions
SimulatorCan measure any event
Vary hardware configurations
(Too) Slow executionOften forces use of scaled-down/simplified workloads
Always repeatable
Virtutech Simics, SimOS, SimpleScalar, etc.
@Carnegie MellonDatabases
32©2004 Anastassia Ailamaki
Hardware Performance Counters
What are they?Special purpose registers that keep track of programmable eventsNon-intrusive counts “accurately” measure processor eventsSoftware API’s handle event programming/overflowGUI interfaces built on top of API’s to provide higher-level analysis
What can they count?Instructions, branch mispredictions, cache misses, etc.No standard set exists
Issues that may complicate lifeProvides only hard counts, analysis must be done by user or toolsMade specifically for each processor
even processor families may have different interfacesVendors don’t like to support because is not profit contributor
@Carnegie MellonDatabases
33©2004 Anastassia Ailamaki
Evaluating Behavior using HW CountersStall time (cycle) counters
very useful for time breakdowns(e.g., instruction-related stall time)
Event countersuseful to compute ratios(e.g., # misses in L1-Data cache)
Need to understand counters before using themOften not easy from documentationBest way: microbenchmark (run programs with pre-computed events)
E.g., strided accesses to an array
@Carnegie MellonDatabases
34©2004 Anastassia Ailamaki
Example: Intel PPRO/PIII
RESOURCE_STALLSResource stallsPARTIAL_RAT_STALLSDependence stalls
IFU_IFETCH_MISSL1 Instruction missesITLB_MISSTLB misses
BR_MISS_PRED_RETIREDBranch mispredictionsBR_INST_DECODEDBranches
IFU_MEM_STALLInstruction-related stallsL2_LINES_INL2 Misses
DCU_LINES_INL1 Data (L1D) missesDATA_MEM_REFSL1 Data (L1D) accesses
INST_RETIREDInstructionsCPU_CLK_UNHALTEDCycles
“time”
Lots more detail, measurable events, statisticsOften >1 ways to measure the same thing
@Carnegie MellonDatabases
35©2004 Anastassia Ailamaki
Producing time breakdowns
Determine benchmark/methodology (more later)Devise formulae to derive useful statisticsDetermine (and test!) software
E.g., Intel Vtune (GUI, sampling), or emonPublicly available & universal (e.g., PAPI [DMM04])
Determine time components T1….TnDetermine how to measure each using the countersCompute execution time as the sum
Verify model correctnessMeasure execution time (in #cycles)Ensure measured time = computed time (or almost)Validate computations using redundant formulae
@Carnegie MellonDatabases
36©2004 Anastassia Ailamaki
Computation
Memory
BranchMispredictions
HardwareResources
Overlap opportunity:Load AD=B+CLoad E
Execution Time = Computation + StallsExecution Time = Computation + Stalls - Overlap
Execution Time Breakdown Formula
Stalls
[ADH99]
@Carnegie MellonDatabases
37©2004 Anastassia Ailamaki
Computation
Memory
BranchMispredictions
HardwareResources
Memory Stalls = Σn(stalls at cache level n)
Where Does Time Go (memory)?
Instruction lookup missed in L1I, hit in L2L1I
Data lookup missed in L1D, hit in L2L1D
Instruction or data lookup missed in L1, missed in L2, hit in memory
L2
[ADH99]
@Carnegie MellonDatabases
38©2004 Anastassia Ailamaki
What to measure?Decision Support System (DSS:TPC-H)
Complex queries, low-concurrencyRead-only (with rare batch updates)Sequential access dominatesRepeatable (unit of work = query)
On-Line Transaction Processing (OLTP:TPCC, ODB)Transactions with simple queries, high-concurrencyUpdate-intensiveRandom access frequentNot repeatable (unit of work = 5s of execution after rampup)
Often too complex to provide useful insight
@Carnegie MellonDatabases
39©2004 Anastassia Ailamaki
Microbenchmarks
What matters is basic execution loopsIsolate three basic operations:
Sequential scan (no index)Random access on records (non-clustered index)Join (access on two tables)
Vary parameters:selectivity, projectivity, # of attributes in predicatejoin algorithm, isolate phasestable size, record size, # of fields, type of fields
Determine behavior and trendsMicrobenchmarks can efficiently mimic TPC microarchitecturalbehavior! Widely used to analyze query execution
Excellent for microarchitectural analysis
[KPH98,ADH99,KP00,SAF04]
@Carnegie MellonDatabases
40©2004 Anastassia Ailamaki
Shore: YES!
On which DBMS to measure?Commercial DBMS are most realistic
Difficult to setup, may need help from companies
Prototypes can evaluate techniquesShore [ADH01] (for PAX), PostgreSQL[TLZ97] (eval)
Tricky: similar behavior to commercial DBMS?
Execution time breakdown
0%
20%
40%
60%
80%
100%
A B C D ShoreDBMS
% e
xecu
tion
time
Computation Memory Branch mispr. Resource
Memory stall time breakdown
0%
20%
40%
60%
80%
100%
A B C D ShoreDBMS
Mem
ory
stal
l tim
e (%
)
L1 Data L2 Data L1 Instruction L2 Instruction
[ADH02]
@Carnegie MellonDatabases
41©2004 Anastassia Ailamaki
Outline
Introduction and OverviewNew HardwareWhere Does Time Go?
Measuring Time (Tools and Benchmarks)Analyzing DBs: Experimental Results
Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research
@Carnegie MellonDatabases
42©2004 Anastassia Ailamaki
DB Execution Time Breakdown
0%
20%
40%
60%
80%
100%
seq. scan TPC-D index scan TPC-C
exec
utio
n tim
e
Computation Memory Branch mispred. Resource[ADH99,BGB98,BGN00,KPH98]
PII XeonNT 4.0Four DBMS: A, B, C, D
At least 50% cycles on stalls Memory is major bottleneck
Branch mispredictions increase cache misses!
@Carnegie MellonDatabases
43©2004 Anastassia Ailamaki
Join (no index)
0%
20%
40%
60%
80%
100%
A B C DDBMS
table scan
0%
20%
40%
60%
80%
100%
A B C DDBMS
Mem
ory
stal
l tim
e (%
)
clustered index scan
0%
20%
40%
60%
80%
100%
A B C DDBMS
L1 Data L2 Data L1 Instruction L2 Instruction
PII Xeon running NT 4.0, used performance countersFour commercial Database Systems: A, B, C, D
DSS/OLTP basics: Cache Behavior
non-clustered index
0%
20%
40%
60%
80%
100%
B C DDBMS
[ADH99,ADH01]
Bottlenecks: data in L2, instructions in L1Random access (OLTP): L1I-bound
@Carnegie MellonDatabases
44©2004 Anastassia Ailamaki
Why Not Increase L1I Size?
L1I: in critical execution pathslower L1I: slower clock
Trends:L1-I cache
Max on-chipL2/L3 cache
‘96 ‘00 ‘04‘98 ‘02Year Introduced
10 KB
100 KB
1 MB
10 MB
Cac
he s
ize
Problem: a larger cache is typically a slower cacheNot a big problem for L2
L1I size is stableL2 size increase: Effect on performance?
[HA04]
@Carnegie MellonDatabases
45©2004 Anastassia Ailamaki
Increasing L2 Cache SizeDSS: Performance improves as L2 cache growsNot as clear a win for OLTP on multiprocessors
Reduce cache size ⇒ more capacity/conflict missesIncrease cache size ⇒ more coherence misses
0%
5%
10%
15%
20%
25%
1P 2P 4P# of processors
% o
f L2
cach
e m
isse
s to
dirt
y da
ta
in a
noth
er p
roce
ssor
's c
ache
256KB 512KB 1MB
Larger L2: trade-off for OLTP
[BGB98,KPH98]
@Carnegie MellonDatabases
46©2004 Anastassia Ailamaki
Summary: Where Does Time Go?Goal: discover bottlenecks
Hardware performance counters ⇒ time breakdownTools available for access and analysis (+simulators)Run commercial DBMS and equivalent prototypesMicrobenchmarks offer valuable insight
Database workloads: more than 50% stallsMostly due to memory delaysCannot always reduce stalls by increasing cache size
Crucial bottlenecksData accesses to L2 cache (esp. for DSS)Instruction accesses to L1 cache (esp. for OLTP)
@Carnegie MellonDatabases
47©2004 Anastassia Ailamaki
Memory
BBranchMispredictions
RHardwareResources
D-cache D
I-cache I
DBMS
DBMS + Compiler
Compiler + Hardware
Hardware
How to Address Bottlenecks
Next: Optimizing cache accesses
@Carnegie MellonDatabases
48©2004 Anastassia Ailamaki
Outline
Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research
@Carnegie MellonDatabases
49©2004 Anastassia Ailamaki
This Section’s Goals
Survey techniques to improve localityRelational dataAccess methods
Survey new query processing algorithmsPresent a new database system architectureBriefly explain Instruction Stream Optimizations
Show how much good understanding of the platform can achieve
@Carnegie MellonDatabases
50©2004 Anastassia Ailamaki
Outline
Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed Gap
Data PlacementAccess MethodsQuery ProcessingInstruction Stream OptimizationsStaged Database Systems
Newer HardwareHip and TrendyDirections for Future Research
@Carnegie MellonDatabases
51©2004 Anastassia Ailamaki
Current Database Storage Managers
Same layout on disk/memoryMulti-level storage hierarchy
different devices at each leveldifferent “optimal” access on each device
Variable workloads and access patterns
OLTP: Full-record accessDSS: Partial-record accessno optimal “universal” layout
non-volatile storage
main memory
CPU cache
Goal: Reduce data traffic in memory hierarchy
@Carnegie MellonDatabases
52©2004 Anastassia Ailamaki
PAGE HEADER 1237RH1
30Jane RH2 4322 John
45
•••
RH4
7658 Susan 52
•
RH3 Jim 201563
37Dan87916
43Leon25345
52Susan76584
20Jim15633
45John43222
30Jane12371
AgeNameSSNRID
R
NSM (n-ary Storage Model, or Slotted Pages)
Records are stored sequentiallyAttributes of a record are stored together
“Classic” Data Layout on Disk Pages
@Carnegie MellonDatabases
53©2004 Anastassia Ailamaki
NSM in Memory Hierarchy
DISK
PAGE HEADER
7658 Susan 52
1237 Jane
Jim 20
4322 John 4530 1563
CPU CACHEMAIN MEMORY
PAGE HEADER
7658 Susan 52
1237 Jane
Jim 20
4322 John 4530 1563
4322 Jo30 Block 1
hn 45 1563 Block 27658Jim 20 Block 3
Susan 52 Block 4
Optimized for full-record accessSlow partial-record access
Wastes I/O bandwidth (fixed page layout)Low spatial locality at CPU cache
Query accesses all attributes (full-record access)Query evaluates attribute “age” (partial-record access)
BEST select namefrom Rwhere age > 50
@Carnegie MellonDatabases
54©2004 Anastassia Ailamaki
Decomposition Storage Model (DSM)
37Dan8791
43Leon2534
52Susan7658
20Jim1563
45John4322
30Jane1237
AgeNameEID
Partition original table into n 1-attribute sub-tables
[CK85]
@Carnegie MellonDatabases
55©2004 Anastassia Ailamaki
12371PAGE HEADER
4322 1563 76583 42
John Jim Suzan
Jane1PAGE HEADER
3 42
30
45 20 52
1PAGE HEADER
3 4
2
376
435
524
203
452
301
AgeRID
R387916
25345
76584
15633
43222
12371
EIDRID
R1
Dan6
Leon5
Suzan4
Jim3
John2
Jane1
NameRID
R2 8KB
8KB
8KB
Partition original table into n 1-attribute sub-tablesEach sub-table stored separately in NSM pages
Decomposition Storage Model (DSM)[CK85]
@Carnegie MellonDatabases
56©2004 Anastassia Ailamaki
DSM in Memory Hierarchy
John Jim Suzan
Jane1PAGE HEADER
3 4
2
DISK
301PAGE HEADER 2 45
20 523 4 5 43
12371PAGE HEADER
4322 1563 76583 42
MAIN MEMORY
301PAGE HEADER 2 45
20 523 4 5 43CPU CACHE
block 1301 2 45
block 220 523 4 5
John Jim Suzan
Jane1PAGE HEADER
3 4
2
12371PAGE HEADER
4322 1563 76583 42
Query accesses all attributes (full-record access)Query accesses attribute “age” (partial-record access)
CostlyCostly
Optimized for partial-record accessSlow full-record access
Reconstructing full record may incur random I/O
BEST
select namefrom Rwhere age > 50
@Carnegie MellonDatabases
57©2004 Anastassia Ailamaki
Repairing NSM’s cache performance
We need a data placement that…Eliminates unnecessary memory accessesImproves inter-record localityKeeps a record’s fields togetherDoes not affect NSM’s I/O performance
and, most importantly, is…
low-implementation-cost, high-impact
@Carnegie MellonDatabases
58©2004 Anastassia Ailamaki
1237RH1PAGE HEADER
30Jane RH2 4322 John
45
1563
RH3 Jim 20
•••
RH4
7658 Susan 52
•
PAGE HEADER 1237 4322
1563
7658
Jane John Jim Susan
30 45 2052
• •••
NSM PAGE PAX PAGE
Idea: Partition data within page for spatial locality
Partition Attributes Across (PAX)
minipage
[ADH01]
@Carnegie MellonDatabases
59©2004 Anastassia Ailamaki
PAX in Memory Hierarchy
1563PAGE HEADER 1237 4322
7658
Jane John Jim Susan
30 45 2052
DISK
1563PAGE HEADER 1237 4322
7658
Jane John Jim Susan
30 45 2052
MAIN MEMORY CPU CACHE
block 152 45 2030
Optimizes CPU cache-to-memory communicationRetains NSM’s I/O (page contents do not change)
cheap
Partial-record access in memoryFull-record access on diskBEST
BEST select namefrom Rwhere age > 50
[ADH01]
@Carnegie MellonDatabases
60©2004 Anastassia Ailamaki
PAX Performance Results (Shore)Cache data stalls
0
20
40
60
80
100
120
140
160
NSM PAXpage layout
stal
l cyc
les
/ rec
ord
L1 Data stallsL2 Data stalls
Execution time breakdown
0
300
600
900
1200
1500
1800
NSM PAX
page layout
cloc
k cy
cles
per
reco
rd
HardwareResource
BranchMispredict
Memory
ComputationQuery:select avg (ai) from R where aj >= Lo
and aj <= Hi
PII Xeon Windows NT4 16KB L1-I&D, 512 KB L2, 512 MB RAM
Validation with microbenchmarks:70% less data stall time (only compulsory misses left)Better use of processor’s superscalar capability
TPC-H performance: 15%-2x speedup in queriesExperiments with/without I/O, on three different processors
[ADH01]
PAX eliminates unnecessary trips to memory
@Carnegie MellonDatabases
61©2004 Anastassia Ailamaki
Dynamic PAX: Data Morphing
PAX random access: more cache misses in recordStore attributes accessed together contiguouslyDynamic partition updates with changing workloads
Optimize total cost based on cache missesPartition algorithms: naïve & hill-climbing algorithms
Fewer cache missesBetter projectivity and scalability for index scan queriesUp to 45% faster than NSM & 25% faster than PAX
Same I/O performance as PAX and NSMFuture work: how to handle conflicts?
[HP03]
@Carnegie MellonDatabases
62©2004 Anastassia Ailamaki
Alternatively: Repair DSM’s I/O behavior
We like DSM for partial record accessWe like NSM for full-record access
Solution: Fractured Mirrors [RDS02]
1. Get data placement right
Sparse B-Tree on ID
4 A4 A51 A1 A2 A3
3
2. Faster record reconstructionLineitem (TPCH) 1GB
020406080
100120140160180
1 2 3 4 6 8 10 12 14No. of Attributes
Seco
nds
NSMPage-at-a-timeChunk-Merge
Instead of record- or page-at-a-time…Chunk-based merge algorithm!
1. Read in segments of M pages ( a “chunk”)2. Merge segments in memory3. Requires (N*K)/M disk seeks4. For a memory budget of B pages, each
partition gets B/N pages in a chunk
@Carnegie MellonDatabases
63©2004 Anastassia Ailamaki
Fractured Mirrors3. Smart mirroring
Achieves 2-3x speedups on TPC-HNeeds 2 copies of the databaseFuture work:
A new optimizerSmart buffer pool managementUpdates
Disk 1 Disk 2
NSM Copy DSM Copy
Disk 1 Disk 2
NSM Copy DSM Copy
@Carnegie MellonDatabases
64©2004 Anastassia Ailamaki
Summary (no replication)
PAXPAX
fullfull--record record accessaccess
MemoryMemory--disk Performancedisk PerformanceCacheCache--memory Performancememory Performance
DSMDSM
NSMNSM
partial record partial record accessaccess
partial record partial record accessaccess
fullfull--record record accessaccess
Page layoutPage layout
☺☺ ☺☺
☺☺ ☺☺☺☺ ☺☺ ☺☺
Need new placement method: Efficient full- and partial-record accessesMaximize utilization at all levels of memory hierarchy
Difficult!!! Different devices/access methodsDifferent workloads on the same database
@Carnegie MellonDatabases
65©2004 Anastassia Ailamaki
Clotho
Operators
Main-memory Manager
LachesisStorage Manager
LachesisStorage Manager
AtroposLV ManagerAtropos
LV Manager
disk array
disk 0 disk 1
Bufferpool
page hdr
The Fates Storage ManagerIDEA: Decouple layout!
non-volatile storage
main memory
CPU cache
[SAG03,SSS04,SSS04a]
data directly placed via scatter/gather I/O
Memory does not need to store full NSM pages
@Carnegie MellonDatabases
66©2004 Anastassia Ailamaki
Clotho: memory stores PAX minipagesDISK select EID from R where AGE>30
1563
PAGE HEADER
1237 4322 7658
30 45 2052
Jane John Jim Susan
MAIN MEMORY2534
PAGE HEADER
2865 1015 8791
Tom Jerry Jean Kate
31 54 3325
PAGE HEADER (EID &AGE)
15631237 4322 7658
30 52 45 20
25342865 1015 8791
31 25 54 33
In-memory “skeleton”(Tailored to query)
On-disk page:PAX-like layoutBlock boundary aligned
Just the data you needQuery-specific pages!Great cache performance
Decoupled layoutFits different hardware
Low reconstruction costDone at I/O levelGuaranteed by Lachesis and Atropos [SSS04]
[SSS04]
New buffer pool manager handles sharing
@Carnegie MellonDatabases
67©2004 Anastassia Ailamaki
CSM: best-case performance of DSM and NSMTable scan time
0
50
100
150
200
250
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15Query payload [# of attributes]
Run
time
[s]
NSMDSMPAXCSM
Table: a1 … a15 (float)
Query: select a1, …from R where a1 < Hi
[SSS04]
TPC-H: Outperform DSM by 20% to 2xTPC-C: Comparable to NSM (6% lower throughput)
CSMCSM
fullfull--record record accessaccess
MemoryMemory--disk Performancedisk PerformanceCacheCache--memory Performancememory Performance
partial record partial record accessaccess
partial record partial record accessaccess
fullfull--record record accessaccess
Page layoutPage layout
☺☺ ☺☺ ☺☺ ☺☺
@Carnegie MellonDatabases
68©2004 Anastassia Ailamaki
Data Placement: Summary
Smart data placement increases spatial localityResearch targets table (relation) dataGoal: Reduce number of non-cold cache misses
Techniques focus grouping attributes into cache lines for quick access
PAX, Data morphing: Cache optimization techniquesFractured Mirrors: Cache-and-disk optimizationFates DB Storage Manager: Independent data layout support across the entire memory hierarchy
@Carnegie MellonDatabases
69©2004 Anastassia Ailamaki
Outline
Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed Gap
Data PlacementAccess MethodsQuery ProcessingInstruction Stream OptimizationsStaged Database Systems
Newer HardwareHip and TrendyDirections for Future Research
@Carnegie MellonDatabases
70©2004 Anastassia Ailamaki
Main-Memory Tree Indexes
T Trees: proposed in mid-80s for MMDBs [LC86]Aim: balance space overhead with searching time Uniform memory access assumption (no caches)
Main-memory B+ Trees: better cache performance [RR99]Node width = cache line size (32-128b)
Minimize number of cache misses for searchMuch higher than traditionaldisk-based B-Trees
So now trees are too deep
How to make trees shallower?
@Carnegie MellonDatabases
71©2004 Anastassia Ailamaki
Reducing Pointers for Larger Fanout
Cache Sensitive B+ Trees (CSB+ Trees)Layout child nodes contiguouslyEliminate all but one child pointers
Double fanout of nonleaf node
B+ Trees CSB+ TreesK1 K2
K3 K4 K5 K6 K7 K8
K1 K3K2 K4
K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4 K1 K3K2 K4
35% faster tree lookupsUpdate performance is 30% worse (splits)
[RR00]
@Carnegie MellonDatabases
72©2004 Anastassia Ailamaki
What do we do with cold misses?
Answer: hide latencies using prefetchingPrefetching enabled by
Non-blocking cache technologyPrefetch assembly instructions
SGI R10000, Alpha 21264, Intel Pentium4
Main MemoryCPU L2/L3CacheL1
Cache
pref 0(r2)pref 4(r7)pref 0(r3)pref 8(r9)
Prefetching hides cold cache miss latencyEfficiently used in pointer-chasing lookups!
@Carnegie MellonDatabases
73©2004 Anastassia Ailamaki
Prefetching B+ Trees(pB+ Trees) Idea: Larger nodesNode size = multiple cache lines (e.g. 8 lines)
Later corroborated by [HP03a]Prefetch all lines of a node before searching it
Cost to access a node only increases slightlyMuch shallower trees, no changes required
Time
Cache miss
[CGM01]
>2x better search AND update performanceApproach complementary to CSB+ Trees!
@Carnegie MellonDatabases
74©2004 Anastassia Ailamaki
Prefetching B+ Trees
Goal: faster range scan
Leaf parent nodes contain addresses of all leavesLink leaf parent nodes togetherUse this structure for prefetching leaf nodes
Leaf parent nodes
pB+ Trees: 8X speedup over B+ Trees
[CGM01]
@Carnegie MellonDatabases
75©2004 Anastassia Ailamaki
Fractal Prefetching B+ TreesWhat if B+-tree does not fit in memory?(fpB+ Trees) Idea: Combine memory & disk trees
Embed cache-optimized trees in disk tree nodesfpB+ Trees optimize both cache AND diskKey compression to increase fanout [BMR01]
Compared to disk-based B+ Trees, 80% faster in-memory searches with similar disk performance
[CGM02]
@Carnegie MellonDatabases
76©2004 Anastassia Ailamaki
Bulk lookups: Buffer Index AccessesOptimize data cache performanceSimilar technique in [PMH02]
Idea: increase temporal locality by delaying (buffering) node probes until a group is formedExample: NLJ probe stream: (r1, 10) (r2, 80) (r3, 15)
r110keyRID
(r1, 10)
buffer
root
B C
D E
(r1,10) is bufferedbefore accessing B
r110keyRID
(r2, 80)
r2 80
B C
(r2,80) is bufferedbefore accessing C
r110keyRID
(r3, 15)
r2 80
B Cr315
B is accessed,buffer entries are
divided among children
[ZR03a]
3x speedup with enough concurrency
@Carnegie MellonDatabases
77©2004 Anastassia Ailamaki
Access Methods: Summary
Optimize B+ Tree pointer-chasing cache behaviorReduce node size to few cache linesReduce pointers for larger fanout (CSB+)“Next” pointers to lowest non-leaf level for easy prefetching (pB+)Simultaneously optimize cache and disk (fpB+)Bulk searches: Buffer index accesses
Additional work:Cache-oblivious B-Trees [BDF00]
Optimal bound in number of memory transfersRegardless of # of memory levels, block size, or level speed
Survey of techniques for B-Tree cache performance [GL01]Existing heretofore-folkloric knowledgeKey normalization/compression, alignment, separating keys/pointers
Lots more to be done in area – consider interference and scarce resources
@Carnegie MellonDatabases
78©2004 Anastassia Ailamaki
Outline
Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed Gap
Data PlacementAccess MethodsQuery ProcessingStaged Database SystemsInstruction Stream Optimizations
Newer HardwareHip and TrendyDirections for Future Research
@Carnegie MellonDatabases
79©2004 Anastassia Ailamaki
Query Processing Algorithms
Idea: Adapt query processing algorithms to cachesRelated work includes:
Improving data cache performanceSortingJoin
Improving instruction cache performanceDSS applications
@Carnegie MellonDatabases
80©2004 Anastassia Ailamaki
Sorting
In-memory sorting / generating runsAlphaSort
Use quick sort rather than replacement selectionSequential vs. random accessNo cache misses after sub-arrays fit in cache
Sort (key-prefix, pointer) pairs rather than records3x cpu speedup for the Datamation benchmark
L2cache
L1
Quick SortReplacement-selection
[NBC94]
@Carnegie MellonDatabases
81©2004 Anastassia Ailamaki
Hash Join
Random accesses to hash tableBoth when building AND when probing!!!
Poor cache performance≥ 73% of user time is CPU cache stalls [CAG04]
Approaches to improving cache performanceCache partitioning – maximizes localityPrefetching – hides latencies
Build Relation
Probe Relation
Hash Table
@Carnegie MellonDatabases
82©2004 Anastassia Ailamaki
Reducing non-cold misses
Idea: Cache partitioning (similar to I/O partitioning)Divide relations into cache-sized partitionsFit build partition and hash table into cacheAvoid cache misses for hash table visits
1/3 fewer cache misses, 9.3% speedup>50% misses due to partitioning overhead
Build Probe
cache
[SKN94]
@Carnegie MellonDatabases
83©2004 Anastassia Ailamaki
Hash Joins in Monet
Monet main-memory database system [B02]Vertically partitioned tuples (DSM)
Join two vertically partitioned relationsJoin two join-attribute arrays [BMK99,MBK00]Extract other fields for output relation [MBN04]
Build Probe
Output
[B02]
@Carnegie MellonDatabases
84©2004 Anastassia Ailamaki
Monet: Reducing Partition CostJoin two arrays of simple fields (8 byte tuples)
Original cache partitioning is single passTLB thrashing if # partitions > # TLB entriesCache thrashing if # partitions > # lines in cache
Solution: multiple passes# partitions per pass is smallRadix-cluster [BMK99,MBK00]
Use different bits of hashed keys fordifferent passesE.g. In figure, use 2 bits of hashed keys for each pass
Plus CPU optimizationsXOR instead of %Simple assignments instead of memcpy
2-pass partitionUp to 2.7X speedup on an Origin 2000
Results most significant for small tuples
[BMK99,MBK00]
@Carnegie MellonDatabases
85©2004 Anastassia Ailamaki
Monet: Extracting Payload
Two ways to extract payload:Pre-projection: copy fields during cache partitioningPost-projection: generate join index, then extract fields
Monet: post-projectionRadix-decluster algorithm for good cache performance
Post-projection good for DSMUp to 2X speedup compared to pre-projection
Post-projection is not recommended for NSMCopying fields during cache partitioning is better
[MBN04]
Paper presented in this conference!
@Carnegie MellonDatabases
86©2004 Anastassia Ailamaki
Optimizing non-DSM hash joins
foreach probe tuple{
(0)compute bucket number;(1)visit header;(2)visit cell array;(3)visit matching build tuple;
}
HashBucket
Headers
Hash Cell (hash code, build tuple ptr)
BuildPartition
0123 0123
time
Cache miss latency
Idea: Exploit inter-tuple parallelism
[CAG04]
Simplified probing algorithm
@Carnegie MellonDatabases
87©2004 Anastassia Ailamaki
Group Prefetching
0123
0123
0123 0
123
0123
0123
a group
foreach group of probe tuples {foreach tuple in group {(0)compute bucket number;
prefetch header;}foreach tuple in group {(1)visit header;
prefetch cell array;}foreach tuple in group {(2)visit cell array;
prefetch build tuple;}foreach tuple in group {(3)visit matching build tuple;
}}
[CAG04]
@Carnegie MellonDatabases
88©2004 Anastassia Ailamaki
Software PipeliningPrologue;for j=0 to N-4 do {
tuple j+3:(0)compute bucket number;
prefetch header;tuple j+2:(1)visit header;
prefetch cell array;tuple j+1:(2)visit cell array;
prefetch build tuple;tuple j:(3)visit matching build tuple;
}Epilogue;
prologue
epilogue
j
j+3
0123
0123
0123
0123
0123
0123
0123
[CAG04]
@Carnegie MellonDatabases
89©2004 Anastassia Ailamaki
Prefetching: Performance Results
Techniques exhibit similar performanceGroup prefetching easier to implementCompared to cache partitioning:
Cache partitioning costly when tuples are large (>20b)Prefetching about 50% faster than cache partitioning
0
1000
2000
3000
4000
5000
6000
150 cycles 1000 cycles
BaselineGroup PrefSP Pref
processor to memory latency
exec
utio
n tim
e (M
cyc
les)
9X speedups over baseline at 1000 cycles
Absolute numbers do not change!
[CAG04]
@Carnegie MellonDatabases
90©2004 Anastassia Ailamaki
DSS: Reducing I-missesDemand-pull execution model: one tuple at a time
ABABABABABABABABAB…If A + B > L1 instruction cache sizePoor instruction cache utilization!
Solution: multiple tuples at an operatorABBBBBAAAAABBBBB…
Modify operators to support block of tuples [PMA01]Insert “buffer” operators between A and B [ZR04]
“buffer” calls B multiple timesStores intermediate tuple pointers to serve A’s requestNo need to change original operators
A
BQuery Plan
12% speedup for simple TPC-H queries
[PMA01,ZR04]
@Carnegie MellonDatabases
91©2004 Anastassia Ailamaki
Concurrency ControlMultiple CPUs share a treeLock coupling: too much cost
Latching a node means writingTrue even for readers !!!Coherence cache misses due to writes from different CPUs
Solution: Optimistic approach for readersUpdaters still latch nodesUpdaters also set node versionsReaders check version to ensure correctness
Search throughput: 5x (=no locking case)Update throughput: 4x
[CHK01]
@Carnegie MellonDatabases
92©2004 Anastassia Ailamaki
Query processing: summary
Alphasort: use quicksort and key prefix-pointerMonet: MM-DBMS uses aggressive DSM
Optimize partitioning with hierarchical radix-clusteringOptimize post-projection with radix-declusteringMany other optimizations
Traditional hash joins: aggressive prefetchingEfficiently hides data cache missesRobust performance with future long latencies
DSS I-misses: group computation (new operator)B-tree concurrency control: reduce readers’ latching
@Carnegie MellonDatabases
93©2004 Anastassia Ailamaki
Outline
Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed Gap
Data PlacementAccess MethodsQuery ProcessingInstruction Stream OptimizationsStaged Database Systems
Newer HardwareHip and TrendyDirections for Future Research
@Carnegie MellonDatabases
94©2004 Anastassia Ailamaki
Instruction-Related Stalls
25-40% of execution time [KPH98, HA04]Recall importance of instruction cache: In the critical execution path!
EXECUTION PIPELINE
L1 I-CACHE L1 D-CACHE
L2 CACHE
Impossible to overlap I-cache delays
@Carnegie MellonDatabases
95©2004 Anastassia Ailamaki
Goal: improve DSS I-cache performanceIdea: Predict next function call using small cache
Example: create_recalways calls find_ , lock_ , update_ , and unlock_ page in same order
Experiments: Shore on SimpleScalar SimulatorRunning Wisconsin Benchmark
Call graph prefetching for DB apps[APD03]
Beneficial for predictable DSS streams
@Carnegie MellonDatabases
96©2004 Anastassia Ailamaki
SIMD: Single – Instruction – Multiple – Data In modern CPUs, target multimedia apps
Example: Pentium 4, 128-bit SIMD register holds four 32-bit values
Assume data stored columnwise as contiguous array of fixed-length numeric values (e.g., PAX)Scan example:
X3 X2 X1 X0
Y3 Y2 Y1 Y0
OP OP OP OP
X3 OP Y3 X2 OP Y2 X1 OP Y1 X0 OP Y0
if x[n] > 10result[pos++] = x[n]
x[n+3] x[n+2] x[n+1] x[n]
10 10 10 10
> > > >
0 1 0 0
8 12 6 5
original scan code
SIMD 1st phase:produce bitmapvector with 4comparison resultsin parallel
[ZR02]DB operators using SIMD
@Carnegie MellonDatabases
97©2004 Anastassia Ailamaki
DB operators using SIMDScan example (cont’d)
Parallel comparisons, fewer branches ⇒ fewer mispredictions
0 1 0 0
SIMD 2nd phase:if bit_vector == 0, continueelse copy all 4 results, increase pos when bit==1
keep this result
[ZR02]
Superlinear speedup to # of parallelismNeed to rewrite code to use SIMD
Aggregation operation (1M records w/ 20% selectivity)
0
5
10
15
20
25
SUM SIMDSUM
COUNT SIMDCOUNT
MAX SIMDMAX
MIN SIMDMIN
Elap
sed
tim
e [m
s] Branch mispred. PenaltyOther cost
@Carnegie MellonDatabases
98©2004 Anastassia Ailamaki
STEPS: Cache-Resident OLTP
Targets instruction-cache performance for OLTPExploits high transaction concurrencySynchronized Transactions through Explicit Processor Scheduling: Multiplex concurrent transactions to exploit common code paths
[HA04]
code fits inI-cache
context-switchpoint
CPU00101100100010110111010011011000110
thread A
CPU executes code
CPU performs context-switch
00101100100010110111010011011000110
thread B
instructioncache
capacitywindow
CPU00101100100010110111010011011000110
thread A
00101100100010110111010011011000110
thread B
before after
All capacity/conflict I-cache misses gone!
@Carnegie MellonDatabases
99©2004 Anastassia Ailamaki
STEPS implementation runs full OLTP workloads (TPC-C)Groups threads per DB operator, then uses fast context-switch to reuse instructions in the cache
Full-system TPC-C implementation:65% fewer L1-I misses, 40% speedup
STEPS: Cache-Resident OLTP[HA04]
STEPS minimizes L1-I cache misses without increasing cache size
@Carnegie MellonDatabases
100©2004 Anastassia Ailamaki
Outline
Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed Gap
Data PlacementAccess MethodsQuery ProcessingInstruction Stream OptimizationsStaged Database Systems
Newer HardwareHip and TrendyDirections for Future Research
@Carnegie MellonDatabases
101©2004 Anastassia Ailamaki
Context loaded multiple times for each queryNo means to exploit overlapping work
Thread-based concurrency pitfalls
Q1Q2Q3
context-switch points TIME
CPU
: component loading time
Current
[HA03]
@Carnegie MellonDatabases
102©2004 Anastassia Ailamaki
Thread-based concurrency pitfalls
Q1Q2Q3
context-switch points TIME
CPU
: component loading time
Current
Q1Q2Q3
CPU
Desired
[HA03]
Context-switch at module boundaryContext-switch at module boundary
Load context once for all queries
Load context once for all queries
Load context once for all queries
@Carnegie MellonDatabases
103©2004 Anastassia Ailamaki
Staged software design allows forCohort scheduling of queries to amortize loading timeSuspend at module boundaries to maintain context
Break DBMS into stagesStages act as independent serversQueries exist in the form of “packets”
Proposed query scheduling algorithms to address locality/wait time tradeoffs [HA02]
Staged Database Systems[HA03]
@Carnegie MellonDatabases
104©2004 Anastassia Ailamaki
Staged Database Systems[HA03]
Optimize instruction/data cache localityNaturally enable multi-query processing
Highly scalable, fault-tolerant, trustworthy
INOUT
connect parser optimizer sendresults
FSCAN
JOIN
SORT
AGGRISCAN
L1
L2
MEMORY
L1
L2
MEMORY
…
@Carnegie MellonDatabases
105©2004 Anastassia Ailamaki
Summary: Bridging the Gap
Cache-aware data placementEliminates unnecessary trips to memoryMinimizes conflict/capacity missesFates: decouple memory from storage layout
What about compulsory (cold) misses?Can’t avoid, but can hide latency with prefetchingTechniques for B-trees, hash joins
Staged Database Systems: a scalable futureAddressing instruction stalls
DSS: Call Graph Prefetching, SIMD, group operatorOLTP: STEPS, a promising direction for any platform
@Carnegie MellonDatabases
106©2004 Anastassia Ailamaki
Outline
Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapNewer HardwareHip and TrendyDirections for Future Research
@Carnegie MellonDatabases
107©2004 Anastassia Ailamaki
Current/Near-future Multiprocessors
Typical platforms:1. Chips with multiple cores2. Servers with multiple chips3. Memory shared across
Memory access:Traverse multiple hierarchiesLarge non-uniform latencies
PP
PP
Mem
Multiprocessor Server
PP
PPP
PP
P
MemMem
Programmer/Software must Hide/Tolerate Latency
@Carnegie MellonDatabases
108©2004 Anastassia Ailamaki
Chip Multi-Processors (CMP)
Highly variable memory latencySpeedup: OLTP 3x, DSS 2.3x on Piranha [BGM00]
Two cores
Shared L2
Example: IBM Power4, Power5
@Carnegie MellonDatabases
109©2004 Anastassia Ailamaki
Simultaneous Multi-Threading (SMT)
Speedup: OLTP 3x, DSS 0.5x (simulated) [LBE98]
Implements threads in a superscalar processorKeeps hardware state for multiple threadsE.g.: Intel Pentium 4 (SMT), IBM Power5 (SMT&CMP)
2 cores * 2 threads
@Carnegie MellonDatabases
110©2004 Anastassia Ailamaki
Outline
Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapHip and Trendy
Query co-processingDatabases on MEMStore
Directions for Future Research
@Carnegie MellonDatabases
111©2004 Anastassia Ailamaki
Oprimizing Spatial Operations
Spatial operation is computation intensiveIntersection, distance computationNumber of vertices per object↑, cost↑
Use graphics card to increase speedIdea: use color blending to detect intersection
Draw each polygon with grayIntersected area is black because of color mixing effectAlgorithms cleverly use hardware features
Intersection selection: up to 64% improvement using graphics card
[SAA03]
@Carnegie MellonDatabases
112©2004 Anastassia Ailamaki
Fast Computation of DB Operations Using Graphics Processors
Exploit graphics features for database operationsPredicate, Boolean operations, Aggregates
Examples:Predicate: attribute > constant
Graphics: test a set of pixels against a reference valuepixel = attribute value, reference value = constant
Aggregations: COUNTGraphics: count number of pixels passing a test
Good performance: e.g. over 2X improvement for predicate evaluations
[GLW04]
Promising! Peak performance of graphics processor increases 2.5-3 times a year
@Carnegie MellonDatabases
113©2004 Anastassia Ailamaki
Outline
Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapHip and Trendy
Query co-processingDatabases on MEMStore
Directions for Future Research
@Carnegie MellonDatabases
114©2004 Anastassia Ailamaki
MEMStore (MEMS*-based storage)
On-chip mechanical storage - using MEMS for media positioning
Read/writetips
Read/writetips
Recordingmedia (sled)Recording
media (sled)
ActuatorsActuators
* microelectromechanical systems
@Carnegie MellonDatabases
115©2004 Anastassia Ailamaki
MEMStore (MEMS*-based storage)
60 - 200 GB capacity4 – 40 GB portable
100 cm3 volume10’s MB/s bandwidth< 10 ms latency
10 – 15 ms portable
2 - 10 GB capacity
< 1 cm3 volume~100 MB/s bandwidth< 1 ms latency
Many parallelheads
Many parallelheads
Singleread/write
head
Singleread/write
head
So how can MEMS help improve DB performance?
* microelectromechanical systems
@Carnegie MellonDatabases
116©2004 Anastassia Ailamaki
Two-dimensional database accessR
ecor
ds
Attributes33
30
27
34
31
28
35
32
29
3
6
12
9
21
24
0
15
18
4
7
13
10
22
25
1
16
19
5
8
14
11
2
17
20
23
26
54
57
60
55
58
61
56
59
62
36 69
39 66
42 63
37 70
40 67
43 64
38 71
41 68
44 65
51 72
48 75
45 78
52 73
49 76
46 79
53 74
50 77
47 80
Exploit inherent parallelism
[SSA03,YAA03,YAA04]
@Carnegie MellonDatabases
117©2004 Anastassia Ailamaki
Two-dimensional database access
0
20
40
60
80
100
NSM - Row order MEMStore - Roworder
NSM - Attributeorder
MEMStore - Attributeorder
Sca
n tim
e (s
)
all a1 a2 a3 a4[SSA03]
Peak performance along both dimensions
@Carnegie MellonDatabases
118©2004 Anastassia Ailamaki
Outline
Introduction and OverviewNew HardwareWhere Does Time Go?Bridging the Processor/Memory Speed GapHip and TrendyDirections for Future Research
@Carnegie MellonDatabases
119©2004 Anastassia Ailamaki
Future research directions
Rethink Query Optimization – with increasing complexity, cost-based optimization not idealMultiprocessors and really new modular software architectures to fit new computers
Current research in DB workloads only scratches surfaceOptimize execution on multiple-core chipsExploit multithreaded processors
Power-aware database systems On embeded processors, laptops, etc.
Automatic data placement and memory layer optimization –one level should not need to know what others do
Auto-everything
Aggressive use of hybrid processors
ACKNOWLEDGEMENTS
@Carnegie MellonDatabases
121©2004 Anastassia Ailamaki
Special thanks go to…
Shimin Chen, Minglong Shao, Stavros Harizopoulos, and Nikos Hardavellas for invaluable contributions to this talkSteve Schlosser (MEMStore)Ravi Ramamurthy (fractured mirrors)Babak Falsafi and Chris Colohan (h/w architecture)
REFERENCES(used in presentation)
@Carnegie MellonDatabases
123©2004 Anastassia Ailamaki
ReferencesWhere Does Time Go? (simulation only)
[ADS02] Branch Behavior of a Commercial OLTP Workload on Intel IA32 Processors. M. Annavaram, T. Diep, J. Shen. International Conference on Computer Design: VLSI in Computers and Processors (ICCD), Freiburg, Germany, September 2002.
[SBG02] A Detailed Comparison of Two Transaction Processing Workloads. R. Stets, L.A. Barroso, and K. Gharachorloo. IEEE Annual Workshop on Workload Characterization (WWC), Austin,Texas, November 2002.
[BGN00] Impact of Chip-Level Integration on Performance of OLTP Workloads. L.A. Barroso, K. Gharachorloo, A. Nowatzyk, and B. Verghese. IEEE International Symposium on High-Performance Computer Architecture (HPCA), Toulouse, France, January 2000.
[RGA98] Performance of Database Workloads on Shared Memory Systems with Out-of-Order Processors. P. Ranganathan, K. Gharachorloo, S. Adve, and L.A. Barroso. International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS), San Jose, California, October 1998.
[LBE98] An Analysis of Database Workload Performance on Simultaneous Multithreaded Processors. J. Lo, L.A. Barroso, S. Eggers, K. Gharachorloo, H. Levy, and S. Parekh. ACM International Symposium on Computer Architecture (ISCA), Barcelona, Spain, June 1998.
[EJL96] Evaluation of Multithreaded Uniprocessors for Commercial Application Environments. R.J. Eickemeyer, R.E. Johnson, S.R. Kunkel, M.S. Squillante, and S. Liu. ACM International Symposium on Computer Architecture (ISCA), Philadelphia, Pennsylvania, May 1996.
@Carnegie MellonDatabases
124©2004 Anastassia Ailamaki
ReferencesWhere Does Time Go? (real-machine/simulation)
[RAD02] Comparing and Contrasting a Commercial OLTP Workload with CPU2000. J. Rupley II, M. Annavaram, J. DeVale, T. Diep and B. Black (Intel). IEEE Annual Workshop on Workload Characterization (WWC), Austin, Texas, November 2002.
[CTT99] Detailed Characterization of a Quad Pentium Pro Server Running TPC-D. Q. Cao, J. Torrellas, P. Trancoso, J. Larriba-Pey, B. Knighten, Y. Won. International Conference on Computer Design (ICCD), Austin, Texas, October 1999.
[ADH99] DBMSs on a Modern Processor: Experimental Results A. Ailamaki, D. J. DeWitt, M. D. Hill, D.A. Wood. International Conference on Very Large Data Bases (VLDB), Edinburgh, UK, September 1999.
[KPH98] Performance Characterization of a Quad Pentium Pro SMP using OLTP Workloads. K. Keeton, D.A. Patterson, Y.Q. He, R.C. Raphael, W.E. Baker. ACM International Symposium on Computer Architecture (ISCA), Barcelona, Spain, June 1998.
[BGB98] Memory System Characterization of Commercial Workloads. L.A. Barroso, K. Gharachorloo, and E. Bugnion. ACM International Symposium on Computer Architecture (ISCA), Barcelona, Spain, June 1998.
[TLZ97] The Memory Performance of DSS Commercial Workloads in Shared-Memory Multiprocessors. P. Trancoso, J. Larriba-Pey, Z. Zhang, J. Torrellas. IEEE International Symposium on High-Performance Computer Architecture (HPCA), San Antonio, Texas, February 1997.
@Carnegie MellonDatabases
125©2004 Anastassia Ailamaki
ReferencesArchitecture-Conscious Data Placement
[SSS04] Clotho: Decoupling memory page layout from storage organization. M. Shao, J. Schindler, S.W. Schlosser, A. Ailamaki, G.R. Ganger. International Conference on Very Large Data Bases (VLDB), Toronto, Canada, September 2004.
[SSS04a] Atropos: A Disk Array Volume Manager for Orchestrated Use of Disks. J. Schindler, S.W. Schlosser, M. Shao, A. Ailamaki, G.R. Ganger. USENIX Conference on File and Storage Technologies (FAST), San Francisco, California, March 2004.
[YAA04] Declustering Two Dimensional
@Carnegie MellonDatabases
126©2004 Anastassia Ailamaki
ReferencesArchitecture-Conscious Access Methods
[ZR03a] Buffering Accesses to Memory-Resident Index Structures. J. Zhou and K.A. Ross. International Conference on Very Large Data Bases (VLDB), Berlin, Germany, September 2003.
[HP03a] Effect of node size on the performance of cache-conscious B+ Trees. R.A. Hankins and J.M. Patel. ACM International conference on Measurement and Modeling of Computer Systems (SIGMETRICS), San Diego, California, June 2003.
[CGM02] Fractal Prefetching B+ Trees: Optimizing Both Cache and Disk Performance. S. Chen, P.B. Gibbons, T.C. Mowry, and G. Valentin. ACM International Conference on Management of Data (SIGMOD), Madison, Wisconsin, June 2002.
[GL01] B-Tree Indexes and CPU Caches. G. Graefe and P. Larson. International Conference on Data Engineering (ICDE), Heidelberg, Germany, April 2001.
[CGM01] Improving Index Performance through Prefetching. S. Chen, P.B. Gibbons, and T.C. Mowry. ACM International Conference on Management of Data (SIGMOD), Santa Barbara, California, May 2001.
[BMR01] Main-memory index structures with fixed-size partial keys. P. Bohannon, P. Mcllroy, and R. Rastogi. ACM International Conference on Management of Data (SIGMOD), Santa Barbara, California, May 2001.
[BDF00] Cache-Oblivious B-Trees. M.A. Bender, E.D. Demaine, and M. Farach-Colton. Symposium on Foundations of Computer Science (FOCS), Redondo Beach, California, November 2000.
[RR00] Making B+ Trees Cache Conscious in Main Memory. J. Rao and K.A. Ross. ACM International Conference on Management of Data (SIGMOD), Dallas, Texas, May 2000.
[RR99] Cache Conscious Indexing for Decision-Support in Main Memory. J. Rao and K.A. Ross. International Conference on Very Large Data Bases (VLDB), Edinburgh, the United Kingdom, September 1999.
[LC86] Query Processing in main-memory database management systems. T. J. Lehman and M. J. Carey. ACM International Conference on Management of Data (SIGMOD), 1986.
@Carnegie MellonDatabases
127©2004 Anastassia Ailamaki
ReferencesArchitecture-Conscious Query Processing
[MBN04] Cache-Conscious Radix-Decluster Projections. Stefan Manegold, Peter A. Boncz, Niels Nes, Martin L. Kersten. In Proceedings of the International Conference on Very Large Data Bases (VLDB), Toronto, Canada, September 2004.
[GLW04] Fast Computation of Database Operations using Graphics Processors. N.K. Govindaraju, B. Lloyd, W. Wang, M. Lin, D. Manocha. ACM International Conference on Management of Data (SIGMOD), Paris, France, June 2004.
[CAG04] Improving Hash Join Performance through Prefetching. S. Chen, A. Ailamaki, P. B. Gibbons, and T.C. Mowry. International Conference on Data Engineering (ICDE), Boston, Massachusetts, March 2004.
[ZR04] Buffering Database Operations for Enhanced Instruction Cache Performance. J. Zhou, K. A. Ross. ACMInternational Conference on Management of Data (SIGMOD), Paris, France, June 2004.
[SAA03] Hardware Acceleration for Spatial Selections and Joins. C. Sun, D. Agrawal, A.E. Abbadi. ACM International conference on Management of Data (SIGMOD), San Diego, California, June,2003.
[CHK01] Cache-Conscious Concurrency Control of Main-Memory Indexes on Shared-Memory Multiprocessor Systems. S. K. Cha, S. Hwang, K. Kim, and K. Kwon. International Conference on Very Large Data Bases (VLDB), Rome, Italy, September 2001.
[PMA01] Block Oriented Processing of Relational Database Operations in Modern Computer Architectures. S. Padmanabhan, T. Malkemus, R.C. Agarwal, A. Jhingran. International Conference on Data Engineering (ICDE), Heidelberg, Germany, April 2001.
[MBK00] What Happens During a Join? Dissecting CPU and Memory Optimization Effects. S. Manegold, P.A. Boncz, and M.L.. Kersten. International Conference on Very Large Data Bases (VLDB), Cairo, Egypt, September 2000.
[SKN94] Cache Conscious Algorithms for Relational Query Processing. A. Shatdal, C. Kant, and J.F. Naughton. International Conference on Very Large Data Bases (VLDB), Santiago de Chile, Chile, September 1994.
[NBC94] AlphaSort: A RISC Machine Sort. C. Nyberg, T. Barclay, Z. Cvetanovic, J. Gray, and D.B. Lomet. ACM International Conference on Management of Data (SIGMOD), Minneapolis, Minnesota, May 1994.
@Carnegie MellonDatabases
128©2004 Anastassia Ailamaki
ReferencesInstrustion Stream Optimizations and
DBMS Architectures
[HA04] STEPS towards Cache-resident Transaction Processing. S. Harizopoulos and A. Ailamaki. International Conference on Very Large Data Bases (VLDB), Toronto, Canada, September 2004.
[APD03] Call Graph Prefetching for Database Applications. M. Annavaram, J.M. Patel, and E.S. Davidson. ACM Transactions on Computer Systems, 21(4):412-444, November 2003.
[SAG03] Lachesis: Robust Database Storage Management Based on Device-specific Performance Characteristics. J. Schindler, A. Ailamaki, and G. R. Ganger. International Conference on Very Large Data Bases (VLDB), Berlin, Germany, September 2003.
[HA02] Affinity Scheduling in Staged Server Architectures. S. Harizopoulos and A. Ailamaki. Carnegie Mellon University, Technical Report CMU-CS-02-113, March, 2002.
[HA03] A Case for Staged Database Systems. S. Harizopoulos and A. Ailamaki. Conference on Innovative Data Systems Research (CIDR), Asilomar, CA, January 2003.
[B02] Monet: A Next-Generation DBMS Kernel For Query-Intensive Applications. P. A. Boncz. Ph.D. Thesis, Universiteit van Amsterdam, Amsterdam, The Netherlands, May 2002.
[PMH02] Computation Regrouping: Restructuring Programs for Temporal Data Cache Locality. V.K. Pingali, S.A. McKee, W.C. Hseih, and J.B. Carter. International Conference on Supercomputing (ICS), New York, New York, June 2002.
[ZR02] Implementing Database Operations Using SIMD Instructions. J. Zhou and K.A. Ross. ACM International Conference on Management of Data (SIGMOD), Madison, Wisconsin, June 2002.
@Carnegie MellonDatabases
129©2004 Anastassia Ailamaki
ReferencesNewer Hardware
[BWS03] Improving the Performance of OLTP Workloads on SMP Computer Systems by Limiting Modified Cache Lines. J.E. Black, D.F. Wright, and E.M. Salgueiro. IEEE Annual Workshop on Workload Characterization (WWC), Austin, Texas, October 2003.
[GH03] Technological impact of magnetic hard disk drives on storage systems. E. Grochowskiand R. D. Halem IBM Systems Journal 42(2), 2003.
[DJN02] Shared Cache Architectures for Decision Support Systems. M. Dubois, J. Jeong , A. Nanda, Performance Evaluation 49(1), September 2002 .
[G02] Put Everything in Future (Disk) Controllers. Jim Gray, talk at the USENIX Conference on File and Storage Technologies (FAST), Monterey, California, January 2002.
[BGM00] Piranha: A Scalable Architecture Based on Single-Chip Multiprocessing. L.A. Barroso, K. Gharachorloo, R. McNamara, A. Nowatzyk, S. Qadeer, B. Sano, S. Smith, R. Stets, and B. Verghese. International Symposium on Computer Architecture (ISCA). Vancouver, Canada, June 2000.
[AUS98] Active disks: Programming model, algorithms and evaluation. A. Acharya, M. Uysal, and J. Saltz. International Conference on Architecture Support for Programming Languages and Operating Systems (ASPLOS), San Jose, California, October 1998.
[KPH98] A Case for Intelligent Disks (IDISKs). K. Keeton, D. A. Patterson, J. Hellerstein. SIGMOD Record, 27(3):42--52, September 1998.
[PGK88] A Case for Redundant Arrays of Inexpensive Disks (RAID). D. A. Patterson, G. A. Gibson, and R. H. Katz. ACM International Conference on Management of Data (SIGMOD), June 1988.
@Carnegie MellonDatabases
130©2004 Anastassia Ailamaki
ReferencesMethodologies and Benchmarks
[DMM04] Accurate Cache and TLB Characterization Using hardware Counters. J. Dongarra, S. Moore, P. Mucci, K. Seymour, H. You. International Conference on Computational Science (ICCS), Krakow, Poland, June 2004.
[SAF04] DBmbench: Fast and Accurate Database Workload Representation on Modern Microarchitecture. M. Shao, A. Ailamaki, and B. Falsafi. Carnegie Mellon University Technical Report CMU-CS-03-161, 2004 .
[KP00] Towards a Simplified Database Workload for Computer Architecture Evaluations. K. Keeton and D. Patterson. IEEE Annual Workshop on Workload Characterization, Austin, Texas, October 1999.
@Carnegie MellonDatabases
131©2004 Anastassia Ailamaki
Useful Links
Info on Intel Pentium4 Performance Counters: ftp://download.intel.com/design/Pentium4/manuals/25366814.pdf
AMD hardware performance countershttp://www.amd.com/us-en/Processors/DevelopWithAMD/
PAPI Performance Libraryhttp://icl.cs.utk.edu/papi/Intel® VTune™ Performance Analyzers http://developer.intel.com/software/products/vtune/