risc-v fault tolerant processor implementation · 2019-09-25 · 3 risc-v • risc-v is an...
Post on 16-Jul-2020
9 Views
Preview:
TRANSCRIPT
RISC-V fault tolerant processor
implementation
Alfonso Sánchez-Macián
ARIES Research Center
2
Agenda
• Project and ISA.
• Error sources and error protection.
• Characterizing and protecting the ISA.
• Implementations. Fault Tolerance.
• Characterizing implementations.
• Example of TMR protection.
• Example of module protection: TLB.
• Example of protection at RTL level: Register Set.
3
RISC-V
• RISC-V is an Instruction Set Architecture (ISA).
• Open source and free.
• Originally developed in Berkeley.
• Supported by a Foundation with more than 200 members.
• They define the evolution for the specifications and the HW/SW environment.
4
RISC-V Specifications
• User-level ISA.
• Modular design.
• Privileged ISA.
• Draft.
• Debug.
5
User-level ISA
• Base: RV32I/RV64I/RV128I. RV32E.
• Extensions:
• M: Integer multiplication and division.
• A: Atomic instructions.
• F: Single-Precision Floating-Point.
• D: Double-Precision Floating-Point.
• Q: Quad-Precision Floating-Point.
• L: Decimal Floating-Point.
• C: Compressed Instructions.
• B: Bit Manipulation.
• J: Dynamically Translated Languages.
• T: Transactional Memory.
• P: Packed-SIMD Instructions.
• V: Vector Operations.
• N: User-Level Interrupts.
6
User-level ISA
• Each implementation should state which modules are
supported:
• Example: RV32IMAC
• RV32G = RV32IMAFD
7
Agenda
• Project and ISA.
• Error sources and error protection.
• Characterizing and protecting the ISA.
• Implementations. Fault Tolerance.
• Characterizing implementations.
• Example of TMR protection.
• Example of module protection: TLB.
• Example of protection at RTL level: Register Set.
8
Error sources
• Processors may suffer from:
• Hard errors (permanent errors)
• Defects during the manufacturing process.
• Processor wear-out.
• Soft errors (temporary errors)
• In memories, the main cause is radiation.
• In logic, there is also:
• Power supply variations wrong gate behavior, crosstalk.
• Temperature Delay
9
Fault tolerance
• Reducing the error probability or the consequences. • Ex: Interleaving, Scrubbing.
• Error detection • Example: parity
• Error correction • Example: Error correction codes(ECC)
10
Fault tolerance II
• Radiation Hardening By Software: • Software techniques. For instance: instruction duplication and verification, invariant
monitoring.
• Application based detection. For instance: task duplication.
• Radiation Hardening By Design:
• ISA oriented.
• Hardware oriented:
• HW architecture (For example: TMR, ECCs)
• HW modules Ad-hoc approaches.
• RTL level.
• Radiation Hardening By Process
11
Fault tolerance III
• Spatial redundancy. • N-Modular Redundancy (NMR).
• Diverse NMR.
• Reduced Precision Redundancy (RPR).
• Temporal redundancy.
• Information redundancy.
12
Failure modes: ASICs vs. FPGA
• ASICs:
• Voltage spike at a node of a circuit (Single Event Transient - SET) or bit flips
in stored information (Single Event Upset - SEU).
• Solutions are simulated or a prototype is created using FPGAs before
producing the actual circuit.
• Reconfigurable FPGAs: • Configuration memory cross-section. Changes in the circuit and its behavior.
• It can also suffer from errors in the user memory (SEUs) and other resources
(Single Event Functional Interrupt).
• Other type of errors such as Multiple Cell/Bit Upsets
(MCU/MBU), Adjacent Bit Upsets …
13
Agenda
• Project and ISA.
• Error sources and error protection.
• Characterizing and protecting the ISA.
• Implementations. Fault Tolerance.
• Characterizing implementations.
• Example of TMR protection.
• Example of module protection: TLB.
• Example of protection at RTL level: Register Set.
14
Errors and program execution
• Based on their effect on the execution of a program.
• Which is the effect of an SEU on an instruction from a
program running on a microprocessors?
• No error (error is masked).
• Hard fault /Memory Access Exception.
• Program hangs.
• Silent Data Corruption (SDC).
15
Characterizing the ISA
• What does it happen when there is
a bit flip in the binary representation
of the instruction?
J. A. Martínez, J. A. Maestro and P. Reviriego, "Evaluating the Impact of the Instruction Set on Microprocessor Reliability to Soft Errors," in IEEE Transactions on Device and Materials Reliability, vol. 18, no. 1, pp. 70-79, March 2018.
instruction
16
Characterizing the ISA
• What does it happen when there is a bit flip in the binary
representation of the instruction?
• The instruction turns into a different one (or to an invalid ISA
opcode)
• If the effect is a program fault (hard fault, hang), it is possible to
detect the error. An SDC is the worst outcome.
• Example: changing the instruction into a Load or Store may
produce a memory access exception.
• Instruction encoding differs among different ISAs. It is usually
optimized for its HW implementation.
17
Characterizing the ISA
18
Intrinsic protection
Example with ARM Cortex M0
19
Characterizing the ISA
• Are there bit positions with less probability of producing an SDC
when a bit flip occurs?
• Analysis of SDC rates as a function of the flipped bit.
• RV32G. Bit 3 is the one that produces less SDCs.
SDC rates as a function of the flipped bit
20
Protecting the ISA
• Parity can be added to detect the error.
• But it requires adding a bit to all the structures that store instructions.
1 bit 32 bits – RV32G
32 bits – RV32G
XOR (parity)
J. A. Martínez, J. A. Maestro and P. Reviriego, "A Scheme to Improve the Intrinsic Error Detection of the Instruction Set Architecture," in IEEE Computer Architecture Letters, vol. 16, no. 2, pp. 103-106, 1 July-Dec. 2017.
21
Increasing the intrinsic protection
• Alternative: Encode parity into the bit that produces less SDCs.
32 bits – RV32G b3
32 bits – RV32G
XOR (parity)
• The original instruction is recovered applying the same operation.
• If a bit flip occurs, the error is propagated to the bit where the parity is encoded less probability of SDC.
22
Increasing the intrinsic protection
ISA
SDC rates when applying the proposed technique
Average
Average
Average
Average
23
Increasing the intrinsic protection
SDC rates as a function of the flipped bit
24
Agenda
• Project and ISA.
• Error sources and error protection.
• Characterizing and protecting the ISA.
• Implementations. Fault Tolerance.
• Characterizing implementations.
• Example of TMR protection.
• Example of module protection: TLB.
• Example of protection at RTL level: Register Set.
25
Implementations
• Open vs. proprietary
• Free vs. paying a license fee (IP).
• IoT, mobile devices, workstations, servers, AI, big data…
• Chisel, Verilog, VHDL …
• Cores, SoC platforms, SoCs.
• ISA variants: RV32I, RV64GC, RV32IMC, RV32EC…
26
Implementations
• And many other. Which one to choose?
Rocket Chip
LowRISC PULPino
BOOM
ORCA
Ariane
https://github.com/riscv/riscv-cores-list
SCR1
GAP8
27
Implementations
• Some of them are Fault Tolerant:
• SHAKTI-F: SEC-DED for memories + DMR for ALU.
• Technolution (master thesis - Delft). RV32I. ECC+TMR
• Other academic proposals.
• Create a new one?
28
Implementations
• Choose an existing one considering:
• ISA extensions implemented by the ISA.
• License.
• Community / Support.
• Complexity/ learning curve (e.g. Chisel).
• Debugging environment and other available software.
29
Agenda
• Project and ISA.
• Error sources and error protection.
• Characterizing and protecting the ISA.
• Implementations. Fault Tolerance.
• Characterizing implementations.
• Example of TMR protection.
• Example of module protection: TLB.
• Example of protection at RTL level: Register Set.
30
Architectural Vulnerability Factor
• Architectural Vulnerability Factor (AVF): • Probability that a failure in a specific processor structure affects the final
output.
• ACE bits. Identify the processor state bits that may affect the Architecturally
Correct Execution of the program.
• AVF for a structure: percentage of time where ACE bits are stored in the
structure.
• Depends on the benchmark being executed.
S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt and T. Austin, "A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor," Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36., San Diego, CA, USA, 2003, pp. 29-40.
31
Instruction Vulnerability Factor
• Instruction Vulnerability Factor (IVF):
• Probability that an error in an instruction affects the final result.
• It also depends on the Benchmark.
A. Azarpeyvand, M. E. Salehi, F. Firouzi, A. Yazdanbakhsh and S. M. Fakhraie, "Instruction reliability analysis for embedded processors," 13th IEEE Symposium on Design and Diagnostics of Electronic Circuits and Systems, Vienna, 2010, pp. 20-23.
32
Characterizing implementations
• Which benchmarks? Which input data should be selected for the input?
• Generate the “Golden” copy (without errors). Output, processor state…
• Errors in Hard processors.
• Simulate.
• Prototype using an FPGA and emulate user logic and memory errors.
• Radiate the ASIC.
• Errors in Soft processors.
• Simulate.
• Implement into the FPGA and simulate errors in configuration memory (e.g.
using SEM IP) and user logic and memory.
• Radiate the ASIC.
• Compare results with the “Golden” copy.
33
Characterization. Example.
• LowRISC – RV64G. Version 0.2.
• FPGA: Xilinx Nexys 4 DDR
• Error injection in configuration memory with SEM IP.
• Failure model: Single event upsets.
• Statistical fault injection campaign (99,8% confidence interval with 1,5% error margin).
• Classification of results: correct, hard fault, hang, application output mismatch, architectural state mismatch (output matches).
• Benchmarks: Quicksort, Hanoi towers, Matrix multiplication, Dijkstra, mergesort and FFT
A. Ramos, J.A. Maestro, P. Reviriego, "Characterizing a RISC-V SRAM-based FPGA Implementation against Single Event Upsets Using Fault Injection", Microelectronics Reliability, Elsevier, Vol. 78, November 2017, pp. 205-211.
34
Characterization. Example (cont.)
35
Characterization. Example (cont.)
36
Agenda
• Project and ISA.
• Error sources and error protection.
• Characterizing and protecting the ISA.
• Implementations. Fault Tolerance.
• Characterizing implementations.
• Example of TMR protection.
• Example of module protection: TLB.
• Example of protection at RTL level: Register Set.
37
Selective TMR
• For soft processors (using LowRISC, same settings that previous
slides).
• Each program uses different resources with different frequency.
• Reduce the use of resources and the power consumption by using
TMR only in the most used resources.
• Create a set of different configurations and reconfigure the FPGA
depending on the subset of programs to be run.
A. Ramos, R. G. Toral, P. Reviriego and J. A. Maestro, "An ALU protection methodology for soft processors on SRAM-based FPGAs," in IEEE Transactions on Computers, 2019 (in press).
38
Selective TMR. Example: ALU
39
Selective TMR. Example: ALU
40
Selective TMR. Example: ALU
41
Agenda
• Project and ISA.
• Error sources and error protection.
• Characterizing and protecting the ISA.
• Implementations. Fault Tolerance.
• Characterizing implementations.
• Example of TMR protection.
• Example of module protection: TLB.
• Example of protection at RTL level: Register Set.
42
Translation Lookaside Buffer
• TLBs based on a CAM (content addressable memory) + RAM
approach.
• Cache for virtual to physical page translation. There might be
several levels of cache.
• Querying and retrieving information from the TLB has to be as
fast as possible.
• Parity is used at some TLB levels. ECC codes have a higher
encoding/decoding.
43
First solution: Shortened Hamming
• Shortening the Hamming code matrix so:
• One of the parity bits only applies to the VPN bits.
• Correction is only executed when an error is detected.
• The other parity bits protect the VPN and PPN together.
• LowRISC:
A. Sánchez-Macián, P. Reviriego and J. A. Maestro, "Combined Modular Key and Data Error Protection for Content-Addressable Memories," in IEEE Transactions on Computers, vol. 66, no. 6, pp. 1085-1090, 1 June 2017.
44
First solution: Shortened Hamming
45
First solution: Shortened Hamming
46
Second solution: MSB for parity
• Parity is stored into the Most Significant Bit (MSB).
• If an SEU occurs, the error is propagated to the MSB, generating
a remote VPN.
• Intrinsic protection increases (against false positives) due to the
spatial locality of the programs.
• Remote VPNs have less probability of being accessed before the
entry in error is evicted.
• If the TLB has already a parity bit it is also possible to provide
protection for double-adjacent errors.
A. Sánchez-Macián, L. A. Aranda, P. Reviriego, V. Kiani and J. A. Maestro, "Enhancing Instruction TLB Resilience to Soft Errors," in IEEE Transactions on Computers, vol. 68, no. 2, pp. 214-224, 1 Feb. 2019.
47
Second solution: MSB for parity
48
Second solution: MSB for parity
49
Second solution: MSB for parity
50
Second solution: MSB for parity
51
Agenda
• Project and ISA.
• Error sources and error protection.
• Characterizing and protecting the ISA.
• Implementations. Fault Tolerance.
• Characterizing implementations.
• Example of TMR protection.
• Example of module protection: TLB.
• Example of protection at RTL level: Register Set.
52
Register Transfer Level Protection
• Take advantage of the resources already used by the system to
provide protection
• E.g. Lowrisc Integer register set in Xilinx FPGAs
• It is a dual port. Operations requiring two operands need them to be
read in the same cycle.
A. Ramos, A. Ullah, P. Reviriego and J. A. Maestro, "Efficient Protection of the Register File in Soft-Processors Implemented on Xilinx FPGAs," in IEEE Transactions on Computers, vol. 67, no. 2, pp. 299-304, 1 Feb. 2018
53
• Vivado implements the dual port memories duplicating RAM32M primitives.
• One is used for the first operand and the other one for the second.
Register Transfer Level Protection
54
• Parity is stored next to Block 11. • It is checked during the read operation. • If an operand parity does not match, reading is done from the
other copy. • If both operands use the same register, they are both read
from the same copy. • Use both clock edges.
Register Transfer Level Protection
55
Register Transfer Level Protection
56
• Thank you!
• Questions?
top related