risc-v fault tolerant processor implementation · 2019-09-25 · 3 risc-v • risc-v is an...

Post on 16-Jul-2020

9 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

RISC-V fault tolerant processor

implementation

Alfonso Sánchez-Macián

ARIES Research Center

2

Agenda

• Project and ISA.

• Error sources and error protection.

• Characterizing and protecting the ISA.

• Implementations. Fault Tolerance.

• Characterizing implementations.

• Example of TMR protection.

• Example of module protection: TLB.

• Example of protection at RTL level: Register Set.

3

RISC-V

• RISC-V is an Instruction Set Architecture (ISA).

• Open source and free.

• Originally developed in Berkeley.

• Supported by a Foundation with more than 200 members.

• They define the evolution for the specifications and the HW/SW environment.

4

RISC-V Specifications

• User-level ISA.

• Modular design.

• Privileged ISA.

• Draft.

• Debug.

5

User-level ISA

• Base: RV32I/RV64I/RV128I. RV32E.

• Extensions:

• M: Integer multiplication and division.

• A: Atomic instructions.

• F: Single-Precision Floating-Point.

• D: Double-Precision Floating-Point.

• Q: Quad-Precision Floating-Point.

• L: Decimal Floating-Point.

• C: Compressed Instructions.

• B: Bit Manipulation.

• J: Dynamically Translated Languages.

• T: Transactional Memory.

• P: Packed-SIMD Instructions.

• V: Vector Operations.

• N: User-Level Interrupts.

6

User-level ISA

• Each implementation should state which modules are

supported:

• Example: RV32IMAC

• RV32G = RV32IMAFD

7

Agenda

• Project and ISA.

• Error sources and error protection.

• Characterizing and protecting the ISA.

• Implementations. Fault Tolerance.

• Characterizing implementations.

• Example of TMR protection.

• Example of module protection: TLB.

• Example of protection at RTL level: Register Set.

8

Error sources

• Processors may suffer from:

• Hard errors (permanent errors)

• Defects during the manufacturing process.

• Processor wear-out.

• Soft errors (temporary errors)

• In memories, the main cause is radiation.

• In logic, there is also:

• Power supply variations wrong gate behavior, crosstalk.

• Temperature Delay

9

Fault tolerance

• Reducing the error probability or the consequences. • Ex: Interleaving, Scrubbing.

• Error detection • Example: parity

• Error correction • Example: Error correction codes(ECC)

10

Fault tolerance II

• Radiation Hardening By Software: • Software techniques. For instance: instruction duplication and verification, invariant

monitoring.

• Application based detection. For instance: task duplication.

• Radiation Hardening By Design:

• ISA oriented.

• Hardware oriented:

• HW architecture (For example: TMR, ECCs)

• HW modules Ad-hoc approaches.

• RTL level.

• Radiation Hardening By Process

11

Fault tolerance III

• Spatial redundancy. • N-Modular Redundancy (NMR).

• Diverse NMR.

• Reduced Precision Redundancy (RPR).

• Temporal redundancy.

• Information redundancy.

12

Failure modes: ASICs vs. FPGA

• ASICs:

• Voltage spike at a node of a circuit (Single Event Transient - SET) or bit flips

in stored information (Single Event Upset - SEU).

• Solutions are simulated or a prototype is created using FPGAs before

producing the actual circuit.

• Reconfigurable FPGAs: • Configuration memory cross-section. Changes in the circuit and its behavior.

• It can also suffer from errors in the user memory (SEUs) and other resources

(Single Event Functional Interrupt).

• Other type of errors such as Multiple Cell/Bit Upsets

(MCU/MBU), Adjacent Bit Upsets …

13

Agenda

• Project and ISA.

• Error sources and error protection.

• Characterizing and protecting the ISA.

• Implementations. Fault Tolerance.

• Characterizing implementations.

• Example of TMR protection.

• Example of module protection: TLB.

• Example of protection at RTL level: Register Set.

14

Errors and program execution

• Based on their effect on the execution of a program.

• Which is the effect of an SEU on an instruction from a

program running on a microprocessors?

• No error (error is masked).

• Hard fault /Memory Access Exception.

• Program hangs.

• Silent Data Corruption (SDC).

15

Characterizing the ISA

• What does it happen when there is

a bit flip in the binary representation

of the instruction?

J. A. Martínez, J. A. Maestro and P. Reviriego, "Evaluating the Impact of the Instruction Set on Microprocessor Reliability to Soft Errors," in IEEE Transactions on Device and Materials Reliability, vol. 18, no. 1, pp. 70-79, March 2018.

instruction

16

Characterizing the ISA

• What does it happen when there is a bit flip in the binary

representation of the instruction?

• The instruction turns into a different one (or to an invalid ISA

opcode)

• If the effect is a program fault (hard fault, hang), it is possible to

detect the error. An SDC is the worst outcome.

• Example: changing the instruction into a Load or Store may

produce a memory access exception.

• Instruction encoding differs among different ISAs. It is usually

optimized for its HW implementation.

17

Characterizing the ISA

18

Intrinsic protection

Example with ARM Cortex M0

19

Characterizing the ISA

• Are there bit positions with less probability of producing an SDC

when a bit flip occurs?

• Analysis of SDC rates as a function of the flipped bit.

• RV32G. Bit 3 is the one that produces less SDCs.

SDC rates as a function of the flipped bit

20

Protecting the ISA

• Parity can be added to detect the error.

• But it requires adding a bit to all the structures that store instructions.

1 bit 32 bits – RV32G

32 bits – RV32G

XOR (parity)

J. A. Martínez, J. A. Maestro and P. Reviriego, "A Scheme to Improve the Intrinsic Error Detection of the Instruction Set Architecture," in IEEE Computer Architecture Letters, vol. 16, no. 2, pp. 103-106, 1 July-Dec. 2017.

21

Increasing the intrinsic protection

• Alternative: Encode parity into the bit that produces less SDCs.

32 bits – RV32G b3

32 bits – RV32G

XOR (parity)

• The original instruction is recovered applying the same operation.

• If a bit flip occurs, the error is propagated to the bit where the parity is encoded less probability of SDC.

22

Increasing the intrinsic protection

ISA

SDC rates when applying the proposed technique

Average

Average

Average

Average

23

Increasing the intrinsic protection

SDC rates as a function of the flipped bit

24

Agenda

• Project and ISA.

• Error sources and error protection.

• Characterizing and protecting the ISA.

• Implementations. Fault Tolerance.

• Characterizing implementations.

• Example of TMR protection.

• Example of module protection: TLB.

• Example of protection at RTL level: Register Set.

25

Implementations

• Open vs. proprietary

• Free vs. paying a license fee (IP).

• IoT, mobile devices, workstations, servers, AI, big data…

• Chisel, Verilog, VHDL …

• Cores, SoC platforms, SoCs.

• ISA variants: RV32I, RV64GC, RV32IMC, RV32EC…

26

Implementations

• And many other. Which one to choose?

Rocket Chip

LowRISC PULPino

BOOM

ORCA

Ariane

https://github.com/riscv/riscv-cores-list

SCR1

GAP8

27

Implementations

• Some of them are Fault Tolerant:

• SHAKTI-F: SEC-DED for memories + DMR for ALU.

• Technolution (master thesis - Delft). RV32I. ECC+TMR

• Other academic proposals.

• Create a new one?

28

Implementations

• Choose an existing one considering:

• ISA extensions implemented by the ISA.

• License.

• Community / Support.

• Complexity/ learning curve (e.g. Chisel).

• Debugging environment and other available software.

29

Agenda

• Project and ISA.

• Error sources and error protection.

• Characterizing and protecting the ISA.

• Implementations. Fault Tolerance.

• Characterizing implementations.

• Example of TMR protection.

• Example of module protection: TLB.

• Example of protection at RTL level: Register Set.

30

Architectural Vulnerability Factor

• Architectural Vulnerability Factor (AVF): • Probability that a failure in a specific processor structure affects the final

output.

• ACE bits. Identify the processor state bits that may affect the Architecturally

Correct Execution of the program.

• AVF for a structure: percentage of time where ACE bits are stored in the

structure.

• Depends on the benchmark being executed.

S. S. Mukherjee, C. Weaver, J. Emer, S. K. Reinhardt and T. Austin, "A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor," Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36., San Diego, CA, USA, 2003, pp. 29-40.

31

Instruction Vulnerability Factor

• Instruction Vulnerability Factor (IVF):

• Probability that an error in an instruction affects the final result.

• It also depends on the Benchmark.

A. Azarpeyvand, M. E. Salehi, F. Firouzi, A. Yazdanbakhsh and S. M. Fakhraie, "Instruction reliability analysis for embedded processors," 13th IEEE Symposium on Design and Diagnostics of Electronic Circuits and Systems, Vienna, 2010, pp. 20-23.

32

Characterizing implementations

• Which benchmarks? Which input data should be selected for the input?

• Generate the “Golden” copy (without errors). Output, processor state…

• Errors in Hard processors.

• Simulate.

• Prototype using an FPGA and emulate user logic and memory errors.

• Radiate the ASIC.

• Errors in Soft processors.

• Simulate.

• Implement into the FPGA and simulate errors in configuration memory (e.g.

using SEM IP) and user logic and memory.

• Radiate the ASIC.

• Compare results with the “Golden” copy.

33

Characterization. Example.

• LowRISC – RV64G. Version 0.2.

• FPGA: Xilinx Nexys 4 DDR

• Error injection in configuration memory with SEM IP.

• Failure model: Single event upsets.

• Statistical fault injection campaign (99,8% confidence interval with 1,5% error margin).

• Classification of results: correct, hard fault, hang, application output mismatch, architectural state mismatch (output matches).

• Benchmarks: Quicksort, Hanoi towers, Matrix multiplication, Dijkstra, mergesort and FFT

A. Ramos, J.A. Maestro, P. Reviriego, "Characterizing a RISC-V SRAM-based FPGA Implementation against Single Event Upsets Using Fault Injection", Microelectronics Reliability, Elsevier, Vol. 78, November 2017, pp. 205-211.

34

Characterization. Example (cont.)

35

Characterization. Example (cont.)

36

Agenda

• Project and ISA.

• Error sources and error protection.

• Characterizing and protecting the ISA.

• Implementations. Fault Tolerance.

• Characterizing implementations.

• Example of TMR protection.

• Example of module protection: TLB.

• Example of protection at RTL level: Register Set.

37

Selective TMR

• For soft processors (using LowRISC, same settings that previous

slides).

• Each program uses different resources with different frequency.

• Reduce the use of resources and the power consumption by using

TMR only in the most used resources.

• Create a set of different configurations and reconfigure the FPGA

depending on the subset of programs to be run.

A. Ramos, R. G. Toral, P. Reviriego and J. A. Maestro, "An ALU protection methodology for soft processors on SRAM-based FPGAs," in IEEE Transactions on Computers, 2019 (in press).

38

Selective TMR. Example: ALU

39

Selective TMR. Example: ALU

40

Selective TMR. Example: ALU

41

Agenda

• Project and ISA.

• Error sources and error protection.

• Characterizing and protecting the ISA.

• Implementations. Fault Tolerance.

• Characterizing implementations.

• Example of TMR protection.

• Example of module protection: TLB.

• Example of protection at RTL level: Register Set.

42

Translation Lookaside Buffer

• TLBs based on a CAM (content addressable memory) + RAM

approach.

• Cache for virtual to physical page translation. There might be

several levels of cache.

• Querying and retrieving information from the TLB has to be as

fast as possible.

• Parity is used at some TLB levels. ECC codes have a higher

encoding/decoding.

43

First solution: Shortened Hamming

• Shortening the Hamming code matrix so:

• One of the parity bits only applies to the VPN bits.

• Correction is only executed when an error is detected.

• The other parity bits protect the VPN and PPN together.

• LowRISC:

A. Sánchez-Macián, P. Reviriego and J. A. Maestro, "Combined Modular Key and Data Error Protection for Content-Addressable Memories," in IEEE Transactions on Computers, vol. 66, no. 6, pp. 1085-1090, 1 June 2017.

44

First solution: Shortened Hamming

45

First solution: Shortened Hamming

46

Second solution: MSB for parity

• Parity is stored into the Most Significant Bit (MSB).

• If an SEU occurs, the error is propagated to the MSB, generating

a remote VPN.

• Intrinsic protection increases (against false positives) due to the

spatial locality of the programs.

• Remote VPNs have less probability of being accessed before the

entry in error is evicted.

• If the TLB has already a parity bit it is also possible to provide

protection for double-adjacent errors.

A. Sánchez-Macián, L. A. Aranda, P. Reviriego, V. Kiani and J. A. Maestro, "Enhancing Instruction TLB Resilience to Soft Errors," in IEEE Transactions on Computers, vol. 68, no. 2, pp. 214-224, 1 Feb. 2019.

47

Second solution: MSB for parity

48

Second solution: MSB for parity

49

Second solution: MSB for parity

50

Second solution: MSB for parity

51

Agenda

• Project and ISA.

• Error sources and error protection.

• Characterizing and protecting the ISA.

• Implementations. Fault Tolerance.

• Characterizing implementations.

• Example of TMR protection.

• Example of module protection: TLB.

• Example of protection at RTL level: Register Set.

52

Register Transfer Level Protection

• Take advantage of the resources already used by the system to

provide protection

• E.g. Lowrisc Integer register set in Xilinx FPGAs

• It is a dual port. Operations requiring two operands need them to be

read in the same cycle.

A. Ramos, A. Ullah, P. Reviriego and J. A. Maestro, "Efficient Protection of the Register File in Soft-Processors Implemented on Xilinx FPGAs," in IEEE Transactions on Computers, vol. 67, no. 2, pp. 299-304, 1 Feb. 2018

53

• Vivado implements the dual port memories duplicating RAM32M primitives.

• One is used for the first operand and the other one for the second.

Register Transfer Level Protection

54

• Parity is stored next to Block 11. • It is checked during the read operation. • If an operand parity does not match, reading is done from the

other copy. • If both operands use the same register, they are both read

from the same copy. • Use both clock edges.

Register Transfer Level Protection

55

Register Transfer Level Protection

56

• Thank you!

• Questions?

top related