computer architecture lab at 1 usability challenges for ramp2 eric chung james c. hoe
Post on 21-Dec-2015
214 Views
Preview:
TRANSCRIPT
2
ProtoFlex in a nut shell
• FPGA-accelerated full-system simulation by virtualization– Hybrid Full-System Simulation
– Multiprocessor Host Interleaving
2
2
1
CPUP
Memory Devices
P
Common-case behaviors
Uncommon behaviors
Memory
4-way P 4-way P
PP
PP
PP
PP
1
2
2
SINGLE
-SLID
E
REVIEW
4
Then came . . .
Simulated target console window
Host command-line for breakpoints, introspection, modification
Inspect/modify registers Create checkpoint Undo the last instruction!
5
Now “they” want . . .
• Using SW simulator, takes 5 lines of Python
– Body of callback runs arbitrary instrumentation code
Execute ‘exc_callback’ every time a CPU hits an exception
Print out exception name and triggering PC
How would you do this in FPGAs?
6
What else could “they” want
• Interaction with virtual display of target system
• Fully deterministic and controllable execution
• Command-line control and scripting capabilities
• API for state inspection/modification
• Modularity features for adding/changing components
• Checkpoint save/restore
• Host-target communication (e.g., for bootstrapping)
• Full-system I/O capabilities (e.g., OS)
• Target resource virtualization
7
Outline
• Introduction
• Practical Feature Development
• Case Study: ProtoFlex Monitoring
• Closing Thoughts
8
Practical Feature Development
• Porting simul. features into FPGA not easy
– RTL modification almost always required
– Unlike SW, state in FPGA not easy to inspect/modify
(but required in most cases)
• Goal: make feature porting easier!
– With minimum FPGA expertise
9
Example
• Using SW simulator, takes 5 lines of Python
– Body of callback runs arbitrary instrumentation code
Execute ‘exc_callback’ every time a CPU hits an exception
Print out exception name and triggering PC
How would we implement in FPGAs?
10
How to implement in FPGA?
• Necessary steps
– Modify RTL of FPGA soft core to monitor exceptions (add bits to pipeline stages, modify decoder)
– Collect PC register during exceptions into trace buffer
– Simulate, debug, synthesize, place + route
– Collect/compress traces from multiple CPU cores (possibly across multiple FPGAs)
– Decompress/post-process traces and print
Can we reduce effort for RAMP developers?
11
Justifying the hardware
• For some efforts, RTL change unavoidable
– Ex: redesign memory subsystem, change # cores
• But for other things, can we do better?
– Instrumentation example from earlier? (print the PC during exceptions)
– Testing a new instruction?
– Inspecting a few CPU registers?
12
Observation
• Only frequent uses of a given hardware modification benefit from FPGA speedup
• Can we relegate infrequent events to software?
• Examples
– Instrumenting rare events (e.g. exceptions)
– Monitoring/analyzing subset of instruction traces
– Periodic sampling of counters
– Monitor range of ‘watched’ memory addresses
13
Outline
• Usability Challenges
• Practical Feature Development
• Case Study: ProtoFlex Monitoring
• Closing Thoughts
14
Case Study: ProtoFlex Monitoring
• Our objective:
– Diagnose an ‘anomaly’ while running commercial apps in BlueSPARC simulator*
• Requirements:
– At runtime, get names of processes running on CPUs
– Extract/verify user- and kernel-level stack traces
– WITHOUT modifications to target workload or OS
*BlueSPARC is our 16-CPU full-system FPGA-based simulator
15
Technique Used: Whitebox Tool
• Whitebox Profiling
– Input: real-time traces from full-system simulation
– Output: human-readable stack traces and visualization
– Simics tool authored by Mike Ferdman & Brian Gold
– Less than 300L of ‘Simics’ Python
DB2 2-CPU-1CL (CPU 0 - Server)
0
20000
40000
60000
80000
100000
120000
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
0 to 2B cycles
% u
ser
unknown
tpcc
sh
sched
nscd
fsflush
db2sysc
db2set
db2fmp
db2fmd
db2fmcd
db2fm
db2bp
automountd
16
Supporting Whitebox in ProtoFlex
• Basic whitebox technique:
– Simulation runtime is periodically halted
– Various registers are first checked
– Virtual-to-physical translations used to locate key data structures
– Physical memory reads are used to extract kernel state
• Naïve solution
– Add state machine to FPGA soft core to perform the steps
– Works but inflexible; may require significant HW changes
Is there an easier way?
17
Solution: Hybrid Simulation for Monitoring
• ProtoFlex hybrid simulation
– Recall in ProtoFlex: CPU pipeline implements only subset of instructions; nearby hard core simulates ISA remainder
Virtex II Pro 70
16-wayPipeline
PowerPC(Hard core)
EthernetInterface to 2ndFPGA (memory)
Processor Bus
Transplants can be used for monitoring!
PowerPC simulates unimplemented
SPARC instructions.
Operation is called a ‘Transplant’
18
Transplants for Monitoring
Virtex II Pro 70
16-wayPipeline
PowerPC(Hard core)
EthernetInterface to 2ndFPGA (memory)
Processor Bus
1) ‘Simulation’ engine periodically requests transplant to PowerPC
2) PowerPC performs ‘inspection’ by requesting register/memory state from engine
transplant() { … read_register(…) translate(…) read_memory(…) …}
Inspection/monitoring code written in C language
SEE DEM
O!
19
Tradeoffs
• Advantages
– SW approach to flexibly monitor events of interest
– For rare events, performance impact negligible
– Validate instrumentation idea before building in HW
• Disadvantages
– If events occur too frequently, must accelerate in HW
– How to know which HW interfaces to provide?
– How to know which events to monitor?
– How to scale to multiple engines?
20
Designing the HW/SW Interface
• In our design, we needed new interfaces between ProtoFlex engine & PowerPC
– Engine can issue memory requests on behalf of PowerPC
– Engine can issue TLB translations on behalf of PowerPC
Still required HW modification!
• For general-purpose monitoring, what interfaces needed?
21
Designing the HW/SW Interface
• Existing simulators good place to look at
– E.g., Simics provides library of over > 100 API calls used for inspection/modification/monitoring
• Example API methods:
– read_register(), write_register(), translate(), etc.
• Also over 50 unique ‘event’ types in simics:
– Used to trigger monitoring callback functions
– Ex: exceptions, watched memory locations, etc.
Build these APIs into RAMP?
22
Addressing Scalability Challenges
• In BlueSPARCv1.0, only 1 centrally-located core
– How to scale monitoring up to tens or hundreds of cores?
• Can we disable ½ of the host cores?
– To monitor the other half
– And provide general-purpose instrumentation / monitoring?
– Compile Simics API calls into distributed kernels that run on cores in monitoring mode
CPU CPU CPU CPU CPU CPU CPU CPU
23
Outline
• Usability Challenges
• Practical Feature Development
• Case Study: ProtoFlex Monitoring
• Closing Thoughts
24
Closing Thoughts
• Attention to user and developer usability is critical for practical RAMP adoption
– Goal: minimize FPGA expertise required when possible
• For users, provide familiar SW-simulation interface
• For developers, provide general-purpose monitoring that is programmable, comprehensive, and scalable
25
Ongoing Work at CMU
• BlueSPARC simulator (ProtoFlex)– Currently supports subset of Simics user interface
– Supports general-purpose software programmable monitoring
– Virtual console/GFX supported via hybrid simulation
• Still many challenges left– Not all Simics commands map easily to FPGA
– Execution is non-deterministic
– Checkpoint generation/loading works (but very slow)
– No ‘Undo-ing’ instructions
– Fine-grained ‘stepping’ for large-scale configurations
– Minor monitoring changes still requires re-synthesizing
Release planned for 2009
26
Thanks! Any questions?echung@ece.cmu.eduhttp://www.ece.cmu.edu/~protoflex
AcknowledgementsWe would like to thank our colleagues inthe RAMP and TRUSS projects.
COME SEE OUR DEMO!
28
Typical Simulator ‘Must-Haves’
• Features commonly available in simulators today:
– Interaction with virtual display of target system
– Fully deterministic and controllable execution
– Command-line control and scripting capabilities
– API for state inspection/modification
– Modularity features for adding/changing components
– Checkpoint save/restore
– Host-target communication (e.g., for bootstrapping)
– Full-system I/O capabilities (e.g., OS)
– Target resource virtualization
29
Software Usage Example
Simulated target console window
Host command-line for breakpoints, introspection, modification
Inspect/modify registers Create checkpoint Undo the last instruction!
30
Bringing SW features to RAMP
• Can users with no knowledge of FPGAs use RAMP out-of-the-box?
• The litmus test
– User is unable to tell using a simulator front-end whether back-end is FPGAs or not
31
Closing Thought: Unification
• Common UI to ‘unify’ simulators and FPGAs
– Ex: use ‘Simics’ front-end, back-end is either FPGAs or SW (ProtoFlex has limited form of this)
– Avoid reinventing API/interface; users already familiar
• Benefits of interoperability:
– Gentle transition of whole generation of full-system simulation users to RAMP
– Support legacy scripts, workloads, configurations
32
Other Simulation Features
• How to provide full-system checkpoints?
– Must save/restore CPU/memory/device states
– But can’t just quickly dump/load 64GB of memory!
• Supporting ‘pause’ and ‘rewind’ in HW
• Deterministic/controllable execution
• ‘Instantaneously’ inspection/modification of distributed CPU/Mem/Device state
33
Case Study: ProtoFlex WhiteBox
• Our goals
– Profile IBM DB2/TPCC
– Identify which processes executing on each CPU at fine-grained intervals (1000s of instructions)
• Technique
– Periodically suspend simulator then access kernel data structures (in known physical memory locations)
– Extract process information from kernel
34
Tools for Visualization/Monitoring
• How to build tools that can make sense out of the behavior of 1000 concurrent threads?
• Dataflow visualization
– E.g., Data flow tomography [Sherwood08]
• Performance monitoring
– E.g., Estimate multi-core cache miss rates
• Black-box program profiling
– E.g., Invisible kernel introspection
35
How to instrument in a practical way?
• E.g., adding new counter to CPU
• Can we have our cake and eat it too?
– SW-like programming abstraction
– Without resynthesizing and keeping FPGA speeds
36
Example 1: Real-time Cache Models
• Generate cache model performance in real time
• Applications:
– Generating cache state checkpoints
37
Example 2: Black-Box Profiling
• Used in ProtoFlex to profile black-box commercial workloads (e.g., IBM DB2, Oracle)
38
Life-cycle of simulation (approximately)
GreatIdea
DesignImplement+Instrument
SimulateMeasurePublish
39
Life-cycle of simulation (approximately)
GreatIdea
DesignImplement+Instrument
SimulateMeasurePublish
40
Back-of-the-envelope calculation
• Let’s calculate opportunity cost of HW-simulation
• Assumptions
– Only goal is to measure given metric (e.g., IPC)
– Don’t care about prototyping
• 12 hours to design, simulate, P&R
– 12 hours = 12 x 3600s x 1KIPS/s = 43M instructions
– On a cluster of 100, can simulate ~4B instructions using detailed timing models in 24 hours
41
The FPGA Usability Challenge
• Despite impressive proof-of-concepts, FPGAs still not widely adopted in arch community
• FPGAs are not user-friendly
– Simulators easier to modify/use
• Lack of instant gratificationslows productivity
– How long to build and run ‘Hello World’ on FPGA?
42
Usability of FPGAs
User Class Usage DescriptionRequired FPGA
expertise
Casual User • Parallel programming on new architectures Low/None?
Serious User
• Use predefined target machine • Want to tweak HW parameters• Requires inspection/changes to system state
Low/Med?
Casual Developer
• Large changes to architecture or components• Monitoring tools to inspect low-level info
Med/High?
Serious Developer
• Build new components or special-purpose processing elements from scratch
High?
Required expertise should be minimized when possible
43
The FPGA Usability Challenge
• Challenges for Users
– Typical simulation features missing or hard-to-build
– Low runtime visibility into FPGA HW
• Challenges for Developers
– Even mundane tasks require RTL design/debugging
– Long synthesis turnaround times (up to hours/days)
– Must learn new languages, (buggy) tools
How to improve usability with RAMP2?
44
Closing the Usability Gap
• Ideally want to provide:
– Fast ‘SW’ simulation interface for casual/serious users
– Fast programming abstractions for casual developers without re-synthesizing designs
• Without sacrificing (too much) FPGA performance
top related