computer architecture lab at 1 protoflex: status update and design experiences eric s. chung,...
Post on 21-Dec-2015
215 views
TRANSCRIPT
![Page 1: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/1.jpg)
Computer Architecture Lab at
1
ProtoFlex: Status Update and Design Experiences
Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi,James C. Hoe, Babak Falsafi, Ken Mai
{echung, enurvita, jhoe, babak, kenmai}@ece.cmu.edu
PROTOFLEX
Our work in this area has been supported in part by NSF, IBM, Intel, and Xilinx.
![Page 2: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/2.jpg)
222
Full-system Functional Simulation• Effective substitute for real (or non-existent) HW
– Can boot OS, run commercial apps
– Important in SW research & computer architecture
• But too slow for large-scale MP studies– Multicore won’t help existing tools
– Is serious challenge for large-MP (1000-way) simulation
REVIEW
![Page 3: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/3.jpg)
333
Alternative: FPGA-based simulation• Only 10x slower in clock freq than custom HW
• But FPGAs harder to use than software– Simulating large-MP (100- to 1000-way) can’t be done trivially
– Simulating full-system support need devices + entire ISA
The “build-all” strategy in FPGAs = significant effort + resources
Memory
PCI Bus
Ethernetcontroller
Graphics card
I/O MMUcontroller
DiskDisk
DMAcontroller
IRQ controller
Terminal
SCSIcontroller
CPU CPUFPGAs
![Page 4: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/4.jpg)
444
Reducing complexity w/ virtualization
Hybrid Full-System SimulationVirtualized MP Simulation
Only frequent behaviors hosted in FPGA. Relegate infrequent to SW.
Target full-system behaviors
FPGA Software
frequent infrequent
CPU CPU CPU CPU CPU
Logical CPUs multiplexed onto fewer physical CPUs.
Host resources
1 FPGA CPU
Host resources
Making multiple physical resources appear as a single logical resource
Making a single physical resource appear as multiple logical resources
21
![Page 5: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/5.jpg)
555
Outline
• Hybrid Full-System Simulation
• Virtualized Multiprocessor Simulation
• BlueSPARC Implementation
• Design Experiences
• Future Work
![Page 6: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/6.jpg)
666
3CPU
Hybrid Full-System Simulation
• 3 ways to map target component to hybrid simulation hostFPGA-only Simulation-only Transplantable
• CPUs can fallback to SW by “transplanting” between hosts– Only common-case instructions/behaviors implemented in FPGA
– Remaining behavs relegated to SW (turns out many of complex ones)
1 2 3
CPU CPU
Memory
MMU Fibre
Graphics NIC PCI
Terminal
SCSI
Software full-system simulator host
Hybrid Simulation
FPGA host
12
I/O instr
CPUCPU
transplant
Transplants reduce full-system design effort
CPUCPU CPU
Memory
MMU Fibre
Graphics NIC PCI
Terminal
SCSI
Software full-system simulator host
CPU
Software-only simulation
![Page 7: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/7.jpg)
777
Outline
• Hybrid Full-System Simulation
• Virtualized Multiprocessor Simulation
• BlueSPARC Implementation
• Design Experiences
• Future Work
![Page 8: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/8.jpg)
8
Virtualized Multiprocessor Simulation• Problem: large-scale simulation configurations challenging to
implement in FPGAs using structurally-accurate approaches
# processors in target model
Structural-accuracy1-to-1 mapping
between target and host CPUs
# host processors implemented in FPGA
Pros: fastest possible solution, only 10x slower than real HW
Cons: difficult to build for large-scale configs (e.g., >100-way)
10x slower than real HW
1-to-1
![Page 9: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/9.jpg)
999
Virtualized Multiprocessor Simulation
Advantages:
• Decouple logical target system size from FPGA host size
• Scale FPGA host as-needed to deliver required performance
• High target-to-host ratio (TH) simplifies/consolidates HW (e.g., fewer # nodes in cache coherence, interconnect)
# processors in target model
HostInterleavingMultiplex target
processors onto fewer # FPGA-hosted processors
# host “engines” implemented in FPGA
40x slower than real HW
4-to-1
![Page 10: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/10.jpg)
101010
What’s inside an FPGA host processor?
• An “engine” that architecturally executes multiple contexts– Existing multithreaded designs are good candidates
– Choice is influenced by TH ratio (target-to-host ratio)
• We propose an interleaved pipeline (e.g., TERA-style)– Best suited for high TH ratio
– Switch in new CPU context on each cycle
– Simple, efficient design w/ no stalling or forwarding
– Long-latency tolerance (e.g., cache miss, transplants)
– Coherence is “free” between CPUs mapped onto same engine
CPU CPU CPU
HOSTCPU
![Page 11: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/11.jpg)
111111
Outline
• Hybrid Full-System Simulation
• Virtualized Multiprocessor Simulation
• BlueSPARC Implementation
• Design Experiences
• Future Work
![Page 12: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/12.jpg)
1212
Implementation: BlueSPARC simulator
16-CPU Shared-memory UltraSPARC III Server
(SunFire 3800) Memory
MMU DMA
Graphics NIC SCSI
Terminal
PCI
CPUCPU CPU..
BEE2 Platform Simics (PC)Xilinx XCV2P70
DDR2MemDDR2Mem
InterleavedPipeline
CPUcontextCPU
context16xCPU
PowerPC
SimulatedI/O devices
![Page 13: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/13.jpg)
1313
BlueSPARC Simulator (continued)Processing Nodes 16 64-bit UltraSPARC III contexts
14-stage instruction-interleaved pipeline
L1 caches Split I/D, 64KB, 64B, direct-mapped, writebackNon-blocking loads/stores16-entry MSHR, 4-entry store buffer
Clock frequency 90MHz on Xilinx V2P70
Main memory 4GB total
Resources (Xilinx V2P70)
33,508 LUTs (50%), 222 BRAMs (67%) w/o stats+debug43,206 LUTs (65%), 238 BRAMs (72%)
Instrumentation All internal state fully traceableAttachable to FPGA-based CMP cache simulator*
EDA tools Xilinx EDK 9.2i, Bluespec System Verilog
Statistics 25K lines Bluespec, 511 rules, 89 module types
Checkpointing Fully compatible with Simics checkpointsCan load AND generate checkpoints
![Page 14: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/14.jpg)
1414
BlueSPARC host microarchitecture
TransplantUnit
TransplantUnit
1 2, 3 4,5 6 87 9,10,11 12,13 14
PowerPC405 (transplant service processor)
64KBI-cache64KB
I-cache
I-TLB16-entry(direct-
mapped)x16
I-TLB16-entry(direct-
mapped)x16
I-TLB128-entry
(2-way)x16
I-TLB128-entry
(2-way)x16
ALU1ALU1ALU2ALU2
64KBD-cache64KB
D-cache
TrapUnitTrapUnit
WritebackUnit
WritebackUnit
D-TLB512-entry
(2-way)x16
D-TLB512-entry
(2-way)x16
D-TLB16-entry
(fully-assoc)
x16
D-TLB16-entry
(fully-assoc)
x16
RegFileRegFile
DecodeDecodePC, statex16
PC, statex16
ContextSelectorContextSelector
AssistUnit
AssistUnit
Normal pipeline stageNormal pipeline stage Multi-context stateMulti-context state Transplant support unitTransplant support unit
64-bit ISA, SW-visible MMU, complex memory high # of pipeline stages
![Page 15: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/15.jpg)
1515
Hybrid host partitioning choices
BlueSPARC (FPGA) Micro-transplant (on-chip simulation)• add/sub/shift/logical• multiply/divide• register windows• 38/103 SPARC ASIs• interprocessor x-calls• device interrupts• I-/D-MMU + tlb miss• Loads/stores/atomics• VIS block memory
• 65/103 SPARC ASIs• VIS I/II multimedia• FP add/sub/mul/div + traps• FP/INT conversion• trap on integer arithmetic• alignment• fixed-point arithmetic• tlb/cache diagnostics• tlb demap
Transplant (off-chip simulation)•PCI bus•ISP2200 Fibre Channel•I21152 PCI bridge•IRQ bus•Fibre Channel SCSI disk/cdrom
•Text Console•SBBC PCI device•Serengeti I/O PROM•Cheerio-hme NIC•SCSI bus
BlueSPARC Micro-transplants(PowerPC405)
ON-CHIP FPGATransplants
(Simics on PC)
OFF-CHIP
![Page 16: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/16.jpg)
1616
Performance
Perf comparable to Simics-fast39x speedup on average over Simics-trace
010203040506070
orac
lebz
ip2
craft
y
gcc
gzip
pars
ervo
rtex
aver
age
MIP
S
BlueSPARC (90mhz)Simics-fast (2.0GHz C2Duo)Simics-trace
1.18
![Page 17: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/17.jpg)
171717
Outline
• Hybrid Full-System Simulation
• Virtualized Multiprocessor Simulation
• BlueSPARC Implementation
• Design Experiences
• Future Work
![Page 18: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/18.jpg)
18
Design experiences
2007 TimelineJanuary-February
Initial virtualization ideasAnalysis + simulation of interleavingISA profiling of apps for hybrid partitioningInitial specifications for host pipeline
March Simics API wrappers + software experimentsApril-November
BlueSPARC RTL developmentValidation tools
November-December
Host performance instrumentation and writeup*
* To appear in FPGA’08
![Page 19: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/19.jpg)
19
Design experiences (cont)
• What was important:– Developing effective validation strategies (more on next slide)
– Existing reference model (Simics) to study and compare against
– Efficient mapping of state to FPGA resources (e.g., 16 PCs 16-bit LUT-based distributed RAM)
– Coping with long Xilinx builds by easing up on timing constraints
– “Judicious” Bluespec
• What was NOT important:– Meeting 100MHz timing for every Xilinx build (i.e., deep pipelining)
– Implementing every functionality as efficiently/fast as possible
![Page 20: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/20.jpg)
20
Validation
• THE most challenging aspect of this project
• Strategies used– Auto-generated torture tests + hand-written test cases
– Auto-port test-cases from OpenSPARC T1 framework to UltraSPARC III
– Validated single-threaded + multithreaded ISA execution against Simics (both in Verilog Simulations and in FPGA)
– Flight data recorder for non-deterministic interleaving of CPUs
– Batched Verilog simulations w/ varying parameters
– Validate non-blocking memory system with “shadow” flat memories during Verilog simulation caught self-modifying code bugs
– > 200 synthesizable assertions to Chipscope
– Built-in deadlock/error detectors
![Page 21: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/21.jpg)
21
In retrospect…
• What I would have done differently to begin with– Write entire USIII functional model myself in software first
– Take more advantage of Verilog PLI for validation (interface to C)
– Don’t over-engineer HDL
– Don’t upgrade tools unless necessary (e.g., trial license runs out)
– Validation infrastructure w/ batching capabilities (do earlier!)
– Automated “binary search” tool for bug hunting
– Re-write DDR2 Async FIFOs without BRAMs
– Fast memory checkpoint loader (3GB images per run = 25m)
– Simple, correct >> Fast, buggy
![Page 22: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/22.jpg)
22
Future Work
• Scalability– Burden-of-proof for 1000-way simulation?
– Investigate cache-coherence/interconnect mechanisms for combining multiple interleaved pipelines
• Virtualization design spaces– On-chip storage virtualization (e.g., architectural state)
– Memory + disk capacity (e.g., HW-based demand paging?)
– Virtualizing instrumentation (e.g., paging functional cache tags)
• Fast instrumentation tools– Understanding systems at multiple levels of abstraction (beyond ISA)
– Validation+analysis: beyond ISA, how to sanity-check app+sys behavior?
![Page 23: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/23.jpg)
23
BlueSPARC Demo on BEE2
23
• Demo application– On-Line Transaction
Processing benchmark (TPC-C) in Oracle
– Runs in Solaris 8 (unmodified binary)
– FPGA + Memory directly loaded from Simics checkpoint
4 DDR2 Controllers + 4 GB memory
Ethernet (to Simics
on PC)
Virtex-II Pro 70 (PowerPC & BlueSPARC) RS232 (Debugging)
BEE2 Platform
![Page 24: Computer Architecture Lab at 1 ProtoFlex: Status Update and Design Experiences Eric S. Chung, Michael Papamichael, Eriko Nurvitadhi, James C. Hoe, Babak](https://reader030.vdocuments.us/reader030/viewer/2022032704/56649d6d5503460f94a4d421/html5/thumbnails/24.jpg)
242424
Conclusion
• “Build-all” simulation approach in FPGAs is challenging
• Two virtualization techniques for reducing complexity
– Hybrid: attain full-system by deferring rare behavs to SW
– Virtualized MP: decouples target system size from host size
• BlueSPARC proof-of-concept
– Models 16-cpu UltraSPARC III server
– Comparable perf to Simics-fast, 39x on avg faster than Simics-trace
• Thanks! Questions? [email protected]• PROTOFLEX (http://www.ece.cmu.edu/~simflex/protoflex.html)