smt verification of the power5 and power6 high-performance processors
Post on 11-May-2015
591 Views
Preview:
TRANSCRIPT
IBM Power Systems
© 2008 IBM Corporation
SMT Verification of the POWER5 and POWER6 High-Performance Processors
John Ludden Senior Technical Staff MemberHardware VerificationIBM Systems & Technology Group
IBM System p
2 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
1. What is a multi-threaded processor?• Essentially a processor core that executes multiple
instruction streams simultaneously• Each thread appears to software as a “virtual” processor core
2. What are the advantages of SMT?• More efficient utilization of silicon real estate and power: small
die size increase compared to adding another core• Increased system throughput by utilizing processor resources
that would otherwise be idle3. What are the disadvantages of SMT?
• Increased complexity -> Makes verification state space MUCH larger
• SMT verification much harder than SMP• Possibly degrades performance of some applications
Introduction to Simultaneous Multi-Threading (SMT)
IBM System p
3 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
1. Video Game Systems• Sony Playstation 3: IBM CELL processor• Xbox 360: IBM Xenon processor
2. Personal Computers:• Intel Pentium 4 Hyper-Threading (HT) processors
3. Servers:• SUN UltraSparc Systems: T1 (4 threads) and T2 (8 threads)• HP Superdome Systems: Intel Itanium 2• IBM Power Systems: POWER5 and POWER6 processors
Examples of SMT microprocessors
IBM System p
4 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
1. Context : POWER5 vs. POWER6 Microarchitecture Comparison
2. Verification methodology: In the beginning…
3. The times they are a changing: SMT arrives in POWER5
4. POWER6: An in-order design should be simpler, but…
5. Future directions?
Overview
IBM System p
5 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Consistent predictable delivery
IBM POWER systems
POWER4+
POWER4
POWER5
POWER5+
POWER6
20012003
20042006
2007
IBM System p
6 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER5 Chip
High FreqPOWER5
SMT2 Core
~2 MB L2
36 MB L3Controller
36 MBL3
Chip
SMP Interconnect Fabric
MemoryController
BufferChips
High FreqPOWER5
SMT2 Core
POWER6 Chip
Ultra FreqPOWER6
SMT2 Core
4 MB L2
32 MB L3Controller
32 MBL3
Chip(s)
SMP Interconnect Fabric
Ultra FreqPOWER6
SMT2 Core
4 MB L2
MemoryController
MemoryController
BufferChips
BufferChips
IBM System p
7 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER5 Pipeline
MP ISS RF EA DC WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF EX WB Xfer
MP ISS RF F6Xfer
F6F6F6F6F6
CP
BRLD/ST
FX
FPGroup Formation and
Instruction Decode
Instruction Fetch
Branch Redirects
Interrupts & Flushes
Out-of-Order Processing
WB
Fmt
D1 D2 D3 Xfer GDD0D0
Shared by two threads Resource used by thread 1Resource used by thread 0
Shared IssueQueues
CP
LSU0FXU0LSU1FXU1
FPU0FPU1BXUCRL
SharedExecution
Units
Read Shared Register Files
DynamicInstructionSelection
ThreadPriority
Group Formation,Instruction Decode,
Dispatch
SharedRegisterMappers
Alternate
TargetCache
Branch Prediction
InstructionTranslation
InstructionCache
ProgramCounter
BranchHistoryTables
ReturnStack
InstructionBuffer 1
InstructionBuffer 0
Write Shared Register Files
GroupCompletion
StoreQueue
DataCache
DataTranslation
L2Cache
IF BPICIF
IBM System p
8 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
High-end server: New POWER6 microprocessorTopology
– Two cores on chip, a 2-way SMP
– Core private L1s (64KB I, 64KB D)
– Superscalar, SMT cores
– Chip private 8 MB L2 cache
– L3 32 MB off chip
– Two-tier SMP fabric
Technology– 65 nm SOI
– 341 mm2 die size
– 10 Layers of metal
– 790 million transistors on chip
– Frequency : 3.5, 4.2, 4.7, 5.0 GHz
Custom & semi-custom design style– High frequency constraints 3.3 M Lines of VHDL
IBM System p
9 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER6 core pipeline
Instruction fetch pipelineInstruction fetch pipeline
BR/FX/Load pipelineBR/FX/Load pipeline
Floating Point PipelineFloating Point Pipeline Check Point Recovery PipelineCheck Point Recovery Pipeline
BR/CRBR/CR
FXFX
LOADLOAD
Legend :Legend : Pre-decode stage
Ifetch/Branch stage
Delayed/Transmit stage
Instruction Decode stage
Instruction Dispatch/Issue stage
Operand access/execution stage
Write back stage
Completion stage
Check Point stage
FX result bypass
Load result bypass
Float result bypass
Cache access stage
P1P1
P2P2
P3P3
P4P4 IC0IC0 ROTROTIC1IC1
EX1EX1
FMTFMTAGAGDISPDISPPDPDIB0IB0 IB1IB1
RFRF
RFRF
RFRF
RFRF DC0DC0 DC1DC1
EX2EX2 EX3EX3 EX4EX4 EX5EX5 EX6EX6 EX7EX7
EXEX
ISSISS ECCECC
ECCECC
BHTBHT
BHTBHT
IFARIFAR
Instruction dispatch pipelineInstruction dispatch pipeline
IBM System p
10 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER6 core
POWER6 processor is ~2X frequency of POWER5 (4 – 5 GHz)
POWER6 instruction pipeline depth equivalent to POWER5– Minimize power – Scale performance with frequency
Instruction Fetch Instruction Buffer/Decode Instruction Dispatch/Issue Data Fetch/Execute
FXU Dependent executionLoad Dependent execution
POWER6 extends functionality of POWER5 core– 64K I cache, 64K D cache, 2 FXU, 2 Binary FPU, 1 branch execution unit– Two way SMT with 7 instruction dispatch from 2 threads (maximum of 5 instructions per thread)– Decimal Floating Point Unit – VMX Unit (PowerPC’s SIMD ISA)– Recovery Unit
~6ns/instr
~3ns/instr
IBM System p
11 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Bullet-proof computing System reliability with recovery unit– Every measure possible taken to preserve application execution
– Retry soft errors
– Change hardware for hard errors
Processor architected state check pointedEvery 1 cycle
ECC & Non-ECC protected circuitry checked Every cycle
Processor restarts from last saved checkpoint
Processor workload moved to another CPU
No error found
No error found
Error found
Error foundSoft error case
Hard error case
IBM System p
12 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Overview
1. Context : POWER5 vs. POWER6 microarchitecture comparison
2. Verification methodology: In the beginning…
3. The times they are a changing: SMT arrives in POWER5
4. POWER6: An in-order design should be simpler, but…
5. Future directions?
IBM System p
13 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER4/5/6 RTL verification technology
RTL(VHDL, Verilog)
Language CompileModel Build
Physical VLSI Design Tools / Custom Design
Cycle-basedModel
Formal Verification:
Boolean Equivalence
Check(Verity)
Software Simulator(MESA)
Hardware Accelerator
(Awan)
Driver/CheckerAssertions
Test Program Generator
(GPRO, X-Gen)
C++Testbench
ConstraintRandom
Unit Testbench
PSL et al.
(Semi) Formal Verification
(SixthSense,RuleBase)
IBM System p
14 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Single threaded uniprocessor verification for POWER4
Unit level: methodology inherited from POWER4– Driven by a combination of instruction level test cases (AVPs) created by Genesys-
Pro (GPRO) pseudo-random test generator and random C++ driven irritation
– Instruction-By-Instruction (IBI) checking against AVP results
– Low level microarchitecture checkers written in C++
Processor core (aka “core”) level– Mixture of GPRO pseudo-random and directed random instruction level test cases
– IBI checking against AVP results
– Low level microarchitecture checkers written in C++
- Irritation from random C++ drivers
- Highly deterministic and architected state easily verifiable against test
IBM System p
15 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Symmetric multi-processor (SMP) verification for POWER4
Chip (dual-core) level– Test generation similar to uniprocessor via GPRO for false-sharing
or non-sharing tests• IBI checking against AVP results for two-independent instruction streams
contained within single test• Low level microarchitecture checkers written in C++• L1/L2 interactions primary focus
– True-sharing scenarios, lock testing and storage access (“weak”) ordering checked
• GPRO employed but….– IBI checking of these accesses is limited or not possible:
› Non-unique or non-deterministic results› CML (architecture level coherency monitor) employed to detect
the “right answer” as a post-simulation rule check
IBM System p
16 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Overview
1. Context : POWER5 vs. POWER6 microarchitecture comparison
2. Verification methodology: In the beginning…
3. The times they are a changing: SMT arrives in POWER5
4. POWER6: An in-order design should be simpler, but…
5. Future directions?
IBM System p
17 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER5 SMT verification methodology
Evolutionary based on single thread uniprocessor and SMP approaches
– Traditional SMP scenarios now self-contained in a single core simulation model• Downward migration of dual-core methodology to single core model
New SMT verification scenario categories– Shared resource and priority conflicts:
• SMT resource types:– Equally shared between threads: Queue full conditions easier to hit– Dynamically shared / tagged: Either thread can consume most/all of the
resource– Replicated: Not shared…same as single thread
– Dynamic thread mode switching: SMT->ST; ST->SMT• Some applications attain better performance in ST mode• Shared resources re-allocated on each mode switch
IBM System p
18 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Traditional SMP approach applied to SMT verification
SMT.tst
Random t0 Random t1
Core Level Registers common to both threads
t0 Registers
SMP.def(test template)
TestGeneration
Real memory is common to both threads with test generator managing some potential overlap
t1 Registers
Output test case
SMT.tst
Random t0 Random t1
Core Level Registers common to both threads
t0 Registers
SMP.def(test template)
TestGeneration
Real memory is common to both threads with test generator managing some potential overlap
t1 Registers
Output test case
IBM System p
19 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Shared resource and priority conflicts
Approach was similar to SMP verification
– Testing largely consisted of “symmetric” instruction streams on each thread
• A particular resource targeted (e.g., GPR rename registers)
– 100 load instructions on each thread
– Coverage and lab feedback validated this approach
• Good enough: “Got the job done”
IBM System p
20 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER5 dynamic thread mode switching
All architected states initializedThread enabledInitial
State
Thread 0 terminates itself
Shared resources reallocated
Random instructions
Normal finishThread enabled
Run State
Random instructions
Restart thread 0
Normal finishThread enabledFinal
State
All architected states initializedThread enabled
Save architected state
Wake up threadPartition resourcesRestore architected
state
Thread kills itself
Random instructions
Thread 0 Thread 1
Sim Driver
Other thread
Interrupt
IBM System p
21 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER5 shared resource re-allocation on mode switch
0100200
GPR FPR
Rename Registers per thread
SMT ModeMaxST Mode 0
510
Split in half
Load Miss Queue entries per thread
SMT ModeST Mode
01020
Split in half
Branch Queue (BIQ) entries per thread
SMT ModeST Mode
02040
DynamicallyShared
Max LRQ/SRQ entries per thread
SMT modeMaxST mode
IBM System p
22 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Overview
1. Context : POWER5 vs. POWER6 microarchitecture comparison
2. Verification methodology: In the beginning…
3. The times they are a changing: SMT arrives in POWER5
4. POWER6: An in-order design should be simpler, but…
5. Future directions?
IBM System p
23 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER5: centralized complexity
POWER5
– Out-of-order design: Even in single thread mode, complex events naturally occur simultaneously
– Started from POWER4+: Known working design that was modified incrementally
– 23 FO4 design: Isolated complexity in Instruction Sequencing Unit (ISU):
• Every unit communicated back to ISU• ISU resolved all exceptions and
out-of-order conflicts
– ST and SMT modes both supported:• Alternating dispatch cycles per thread• Resources re-allocated on mode switch
FXU
FPU
LSU
IFU
ISU
IBM System p
24 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
POWER6 distributed complexity
POWER6 – From-scratch mostly in-order design
• Normally, design is well behaved• Cross-thread interaction necessary for “tough
bugs”
– 13 FO4 design: Distributed complexity needed to achieve high performance goals
– Recovery unit (RU): • Must resolve out-of-order FP with in-order
pipelines• Checkpoints machine state• Recovers processor from soft errors
– Design is inherently in SMT mode all the time (almost)
• Dispatch to both threads in same cycle• Most resources dynamically shared / tagged• No resource reallocation on mode switch
IFU
IDU
FPU
LSU
RU
FXU
IBM System p
25 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
The different verification engines have different strengths related to the verification tasks
POWER6 verification process
Software simulation– Slow, but low penalty for highly intrusive checking of model internals. Total model visibility.– Hundreds of AIX workstations running 24x7x365– New enhancements helped keep pace with design complexity– 2x number of simulation cycles of POWER5 design
Hardware-accelerated simulation– 10-1k x Faster than SW sim, but need less intrusive driving/checking to not slow down hardware box.– New usage: Mainline function verification– Yields additional 3x simulation cycle advantage over POWER5 (5x cycle advantage overall)
(Semi)-formal verification– (High to) Exhaustive coverage, but higher skill needed to drive. Scaling problems w/ model size.– Extensively used: Proved extremely valuable for complex SMT bugs
Hardware bring-up– Ideal speed, very limited visibility/controllability
IBM System p
26 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Software simulation enhancements
Random command driven unit simulation for most core units– Yielded >1 Million lines of C++ code
– More control over generation for low level events
– More efficient test generation
Irritator threads at “core model” level– “Symmetric” instruction stream approach employed on POWER5 proved inadequate
“S” in SMT is for “Simultaneous”, not “Symmetric”
– Target cross-thread interactions at the microarchitecture level
– ~2x test generation efficiency
– Ensures both threads running the same length (self adjusting)
IBM System p
27 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Irritator thread example
SMT_Irritator.tst
Long Random t0
Short Irritator t1
Core Level Registers common to both threads
SMT_Irritator.def(test template)
Test Generation
Real memory with test generator managing some potential overlap
Irritator thread restrictions
• Cannot cause unexpected exceptions
• Cannot modify memory read by random thread
• Cannot modify registers shared with other threads
• Architected results may be undefined
t1 Registerst0 Registers
Output test case
IBM System p
28 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Irritator thread example
SEQUENCEREPEAT 100
SELECTGroup_All
stw nop, A
SEQUENCELB0: fdivA: b to LB0
Long Random Thread Irritator Thread
Generated Instr: 101Simulated Instr: 101
Generated Instr: 2Simulated Instr: Infinite
Kill Irritator Thread
IBM System p
29 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Simulation acceleration usage on POWER6
Extensively used on POWER6
– Run lab exercisers prior to tape-out• Found additional bugs missed by software simulation• Debug new exerciser functionality prior to lab• Error injection and recovery testing• Reproducibility of lab bugs in “simulation-like” environment for rapid debug of root cause
• Rapid testing of bug fixes and collateral damage testing
– Linux boot prior to tape-out
– Not employed on POWER5 for “mainline” functional verification
IBM System p
30 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Formal methods are a vital complement to simulation flow
– Lab bring-up bug re-creation• Often faster reproduction than simulation based
approaches• Aids in root cause analysis• High-coverage / proof of side-effect-free fixes
(Semi) Formal methods
IBM System p
31 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Error detection and soft error recovery
Biggest challenge on POWER6
– Why so hard?
• Myriads of injection points coupled with large SMT state space– Often needed multiple “rare” combinations of “asymmetric” events on both threads while specific error was injected
• End-to-end recovery testing difficult at unit level– Really a “core” effort
– Verification strategy:
– Error injection and recovery on hardware accelerated simulation platform
– Dynamic on-the-fly error injection combined with “irritator threads” needed to cover large SMT recovery state space
IBM System p
32 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Summary
1. SMT verification has four key pieces– Traditional SMP-like effort– Thread starvation and priority– Starting and stopping threads– Asymmetric “irritator thread” approach to verify often unforeseen cross-thread interactions at
the microarchitecture level
2. “From-scratch in-order” SMT design was more difficult to verify than the “out-of-order retrofitted” SMT design
– Complex events only occurred due to cross thread interaction– Even though team had experience– Required more “weapons” in the arsenal
3. High frequency design drove distributed complexity– Makes verification job harder– Increased dependency on formal verification for difficult bugs
4. “Mainframe”-like RAS on POWER6 drove a huge amount of work that was difficult to attack at the unit level
IBM System p
33 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Overview
1. Context : POWER5 vs. POWER6 microarchitecture comparison
2. Verification methodology: In the beginning…
3. The times they are a changing: SMT arrives in POWER5
4. POWER6: An in-order design should be simpler, but…
5. Future directions?
IBM System p
34 © 2006 IBM Corporation IBM SystemsDRAFT: IBM Confidential © 2008 IBM CorporationIBM Systems & Technology
SMT Verification of the POWER5 and POWER6 High-Performance Processors
Future directions
Predictions– RAS features will be an increasingly important feature of server
systems• POWER6 design has set the “bar” to a new high standard to which future
processors will have to measure up- Power Systems Revenue up 29% in 2Q08 (from 2Q07)
• Verification methods employed on POWER6 to attack nearly infinite state space created by the combination of SMT and processor recovery features will become standard practice
– A migration of “pre-silicon” verification techniques into “post-silicon” hardware lab verification effort
• Hardware is the fastest “simulator” available and the state space is getting bigger with SMT
top related