qed effective post-silicon validation and debug qed€¦ · use qed check ra –original register...

1
Use QED check Ra – original register Ra’ – corresponding duplicated register Ra ≠ Ra’ – ERROR DETECTED L2 Bank 1 QED Effective Post-Silicon Validation and Debug Eshan Singh, David Lin, PI: Subhasish Mitra, Robust Systems Group, Stanford University Post-Silicon Validation Critical Q uick E rror D etection Q uick E rror D etection Highlights Symbolic QED Electrical Bugs Structured and Effective 10 9 X quicker detection, 4X coverage Automatically localize logic bugs No failure reproduction, no simulation Broadly applicable Cores, uncore, power management, logic & electrical, accelerators Source: Intel Post- silicon bug count Year Pre-silicon verification inadequate “Post-silicon cost & complexity rising faster than design cost” – S. Yerramilli, V.P., Intel Design Pre-silicon Verification Post - silicon Validation High Volume Fab Localization Dominates Cost Detect bugs Root-cause & fix Run tests (OS, games) Debug time: 1-4 weeks per bug Localize bugs Long Error Detection Latency Challenge Localization Timeline Error occurred Error detection latency Ideal ~ 1,000 cycles Reality ~ Billions cycles Error detected Test execution Intel® 48-Core SCC Symbolic QED Results Fast QED using Hardware Support QED Wide variety Diversity Systematic Automated QED family Tests QED Test 1 QED Test 2 QED Test N Original Tests Test 1 Test 2 Test N Error detection latency: guaranteed short Coverage: improved Software & hardware approaches Detected error count (normalized to QED) QED 0 0.5 1 1-10 Billion No-QED Error detection latency (clock cycles) 0-10K Detected error count (normalized to QED) QED 0 0.5 1 1-10 Billion No-QED Error detection latency (clock cycles) 0-10K 10 6 X 4X Software-only QED no hardware modifications, bugs inside processor cores, bugs inside uncore components, bugs from power- management features Hybrid QED Non-programmable accelerators, logic bugs and electrical bugs Symbolic QED Automatically localize logic bugs, no additional hardware Fast QED 0.4% area overhead, very low runtimes QED Transformation Examples Fully automated logic bug localization using Bounded Model Checking (BMC) No trace buffers → No area overhead Effective for large SoCs No failure reproduction, no simulation Collaborator: Prof. Clark Barrett (NYU) Traditional debug Automatic S-QED Weeks to months 20 mins. to 7 hours Long bug traces 3- to 22-cycle bug traces ... Core 1 Core 2 <PLC mem [1..N]> <PLC mem [1..N]> <PLC mem [1..N] > <PLC mem [1..N]> <PLC mem [1..N] > Core N <PLC mem [1..N]> <PLC mem [1..N]> <PLC mem [1..N]> A’=A B’=B C’=C A = B * 2 A’= B’* 2 Check(A==A’) D’=D E’=E F’=F G’=G H’=H E = F * G E’= F’* G’ Check(E==E’) H = D + E H’= D’+ E’ Check(H==H’) E’=E I’=E J’=J K’=K I = E / 2 I’= E’/ 2 Check(I==I’) Load J ← mem[7 ] Load J’← mem[7’] Check(J==J’) K = J + 1 K’= J’+ 1 Check(K==K’) Lock(1,’1) Store mem[1 ] C Store mem[1’] C’ Unlock(1,1’) Lock(5,5’) Store mem[5 ] H Store mem[5’] ← H’ Unlock(5,5’) ALL Cores ALL Threads <PLC mem[1..N]> for ALL i,i’ Lock(i) Lock(i’) Load X mem[i] Load X’mem[i’] Check (X == X’) Unlock(i’) Unlock(i) IEEE TCAD comments (QED paper) “All reviewers agree this will be a classic paper for years to come.” “I will personally pay for page charges if you promise to thank me (anonymously) when you win a major award for this paper!” Intel (Nagib Hakim, PE) QED is revolutionary... Intel is in the process of implementing a prototype of QED. This would enable a whole slew of applications.” AMD (Jeff Rearick, Senior Fellow) QED: “magical thinking needed” in ETS keynote. Freescale (Sharad Kumar, Manager) We evaluated QED & are adopting in our tools flow for multi-core debug.” QED is one such promising technique that we have evaluated and are adopting in our tools flow for multi-core debug. Proactive Load and Check Control Flow Tracking Using Software Signatures if ((last_signature == #3) or (last_signature == #4)): last_signature = #5 else: ERROR_DETECTED! <Block 5> CFCSS-V Block 2 CFCSS-V CFCSS-V CFCSS-V CFCSS-V Block 3 Block 4 Block 1 Block 5 CFCSS-V Block 5: ERROR! Freescale SoC Logic Bug Error detection latency (cycles) Original QED 15 Billion 9 Interconnection network Core 1 Core 0 Core N Core 2 Core 3 Random Instruction Test Generator Shared Caches Memory Controllers Accelerators Other uncore components Error detection latency (cycles) Cumulative memory bugs detected 100 1K 10K 10 Billion 0% 20% 40% 60% 80% 100% 10 6 X improved QED Original test 8-Core Industrial Test QED Med., Max. EDL: 392, 3k Original test Med., Max. EDL: 10M, 100M 0% 20% 40% 60% 80% 100% 100 1k 10k 100k 1M 10M >100M 10 4 X 2X Cumulative Bugs Detected Error detection latency (clock cycles) Power Management Bugs 0 10k 20k 0 20 100 60 140 PLC-H checkers count Area cost 0.05% 0.4% 0.05% - 0.4% area impact Error detection latency (cycles) Fast QED 10 5 X quicker detection 2X coverage No intrusiveness Runtime: 1.04X 6X MBIST reuse Core, uncore, power management bugs Uncore Bugs No boot Pass 48 processor cores 0.9V, 800 MHz QED unique detect QED enhanced detect QED quick detect Error detection latency (cycles) Cumulative bugs detected 100 1k 10k 100k 1M 10M 0% 20% 40% 60% 80% 100% 10 4 X 2X Original Med., Max. EDL: 241k, 10M QED Med., Max. EDL: 675, 8k Difficult Logic Bugs QED Techniques Hybrid QED Error Detection Latency (cycles) Coverage (percentage) 1 10 100 1k 10k 100k 1M 10M 0% 20% 40% 60% 80% 100% Hybrid QED: Mean EDL= 705 cycles Original Mean EDL = 124k cycles 10 2 X Improved Accelerator validation and debug Using high-level synthesis Collaborator: Prof. Deming Cheng (UIUC) 0% 20% 40% 60% 80% 100% 0 100 1K 10K 100k 1M Cumulative bugs detected Bug Trace Length (cycles) >10M Original Min., Mean, Max.: 722, 1.9M, 11M Symbolic QED Min., Mean, Max.: 13, 20, 29 10 6 X 2X BMC Tool Automatically Overnight 1. “Universal” Property QED Check + Initial State Logic Bugs Localized 2. Partial Instances + QED Modules 1. “Universal” Property: QED Check What property should the BMC tool check? 2. Partial Instantiation How to ensure the design fits in the BMC tool? CMP Ra == Ra’ QED checks are Compositional Not design/implementation specific Preserved across partial instances Unlike tradition properties Systematically instantiate only the modules needed to activate the bug BMC tool finds a bug trace Core 1 Core 0 Core 2 Core 3 Core 4 Core 5 Core 6 Core 7 L2 Bank 0 L2 Bank 1 L2 Bank 2 L2 Bank 3 L2 Bank 4 L2 Bank 5 L2 Bank 6 L2 Bank 7 Memory controller 0 Memory controller 1 Memory controller 2 Memory controller 3 I/O controllers Crossbar interconnect Core 0 L2 Bank 0 Crossbar interconnect Core 0 L2 Bank 0 Memory controller 0 Crossbar interconnect Core 1 Core 0 L2 Bank 0 Crossbar interconnect Memory controller 0 Reduce Instances Keep at least 1 core Run Each No Trace Found Trace Found Trace Found Best Localization

Upload: others

Post on 16-Apr-2020

13 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: QED Effective Post-Silicon Validation and Debug QED€¦ · Use QED check Ra –original register Ra’ –corresponding duplicated register Ra ≠ Ra’ –ERROR DETECTED L2 Bank

Use QED check

Ra – original register

Ra’ – corresponding duplicated register

Ra ≠ Ra’ – ERROR DETECTED

L2Bank 1

QED Effective Post-Silicon Validation and DebugEshan Singh, David Lin, PI: Subhasish Mitra, Robust Systems Group, Stanford University

Post-Silicon Validation Critical Quick Error Detection Quick Error Detection Highlights Symbolic QED

Electrical Bugs

Structured and Effective

109X quicker detection, 4X coverage

Automatically localize logic bugs

No failure reproduction, no simulation

Broadly applicable

Cores, uncore, power management, logic

& electrical, acceleratorsSource: Intel

Post-silicon bug

count

Year

Pre-silicon

verification

inadequate

“Post-silicon cost & complexity rising faster than design cost”

– S. Yerramilli, V.P., Intel

DesignPre-silicon

Verification

Post-silicon

Validation

High

Volume

Fab

Localization Dominates Cost

Detect bugs

Root-cause & fix

Run tests (OS, games)

Debug time:

1-4 weeks per bug

Localize bugs

Long Error Detection Latency Challenge

Localization

Timeline

Error

occurred

Error detection latency

Ideal ~ 1,000 cycles

Reality ~ Billions cycles

Error

detected

Test

execution

Intel® 48-Core SCC

Symbolic QED Results

Fast QED using Hardware Support

QED

Wide variety Diversity

SystematicAutomated

QED family

Tests

QED Test 1

QED Test 2

QED Test N

Original

TestsTest 1

Test 2

Test N

Error detection latency: guaranteed short

Coverage: improved

Software & hardware approaches

De

tecte

d e

rro

r co

un

t

(no

rma

lize

d t

o Q

ED

)

QED

0

0.5

1

1-10 Billion

No-QED

Error detection latency (clock cycles)

0-10K

De

tecte

d e

rro

r co

un

t

(no

rma

lize

d t

o Q

ED

)

QED

0

0.5

1

1-10 Billion

No-QED

Error detection latency (clock cycles)

0-10K

106X

4X

Software-only QED

no hardware modifications, bugs inside processor cores, bugs inside uncore components, bugs from power-management features

Hybrid QED Non-programmable accelerators, logic bugs and electrical bugs

Symbolic QED Automatically localize logic bugs, no additional hardware

Fast QED 0.4% area overhead, very low runtimes

QED Transformation Examples

Fully automated logic bug localization using

Bounded Model Checking (BMC)

No trace buffers → No area overhead

Effective for large SoCs

No failure reproduction, no simulation

Collaborator: Prof. Clark Barrett (NYU)

Traditional debug Automatic S-QED

Weeks to months 20 mins. to 7 hours

Long bug traces 3- to 22-cycle bug traces

...

Core 1 Core 2

<PLC mem[1..N]>

<PLC mem[1..N]>

<PLC mem[1..N]>

<PLC mem[1..N]>

<PLC mem[1..N]>

Core N

<PLC mem[1..N]>

<PLC mem[1..N]>

<PLC mem[1..N]>

A’=A B’=B C’=C

A = B * 2

A’= B’* 2

Check(A==A’)

D’=D E’=E F’=F

G’=G H’=H

E = F * G

E’= F’* G’

Check(E==E’)

H = D + E

H’= D’+ E’

Check(H==H’)

E’=E I’=E

J’=J K’=K

I = E / 2

I’= E’/ 2

Check(I==I’)

Load J ← mem[7 ]

Load J’← mem[7’]

Check(J==J’)

K = J + 1

K’= J’+ 1

Check(K==K’)

Lock(1,’1)

Store mem[1 ] ← C

Store mem[1’] ← C’

Unlock(1,1’)

Lock(5,5’)

Store mem[5 ] ← H

Store mem[5’] ← H’

Unlock(5,5’)

ALL Cores

ALL Threads

<PLC mem[1..N]>

for ALL i,i’

Lock(i)

Lock(i’)

Load X ← mem[i]

Load X’← mem[i’]

Check (X == X’)

Unlock(i’)

Unlock(i)

IEEE TCAD comments (QED paper)

“All reviewers agree this will be a classic paper for years to come.”

“I will personally pay for page charges if you promise to thank me (anonymously) when you win a major award for this paper!”

Intel (Nagib Hakim, PE)

“QED is revolutionary... Intel is in the process of implementing a prototype of QED. This would enable a whole slew of applications.”

AMD (Jeff Rearick, Senior Fellow)

QED: “magical thinking needed” in ETS keynote.

Freescale (Sharad Kumar, Manager)

“We evaluated QED & are adopting in our tools flow for multi-core debug.”

QED is one such promising technique that we have evaluated and are adopting in our tools flow for multi-core debug.

Proactive Load and Check

Control Flow Tracking Using Software Signatures

if ((last_signature == #3) or(last_signature == #4)):

last_signature = #5

else:ERROR_DETECTED!

<Block 5>

CFCSS-V

Block 2

CFCSS-V

CFCSS-V

CFCSS-V

CFCSS-V

Block 3

Block 4

Block 1

Block 5

CFCSS-V Block 5:

ERROR!

Freescale SoC Logic Bug

Error detection latency (cycles)

Original QED

15 Billion 9

Interconnection network

Core 1Core 0 Core NCore 2 Core 3

Random Instruction Test Generator

Shared

Caches

Memory

ControllersAccelerators

Other uncore

components

Error detection latency (cycles)

Cu

mu

lati

ve m

emo

ry b

ugs

det

ecte

d

100 1K 10K 10 Billion

0%

20%

40%

60%

80%

100%

106X

improvedQED

Original test

8-Core Industrial TestQED Med., Max. EDL:392, 3k

Original testMed., Max. EDL:10M, 100M

0%

20%

40%

60%

80%

100%

100 1k 10k 100k 1M 10M >100M

104X

2X

Cu

mu

lati

ve B

ugs

Det

ecte

d

Error detection latency (clock cycles)

Power Management Bugs

0

10k

20k

0 20 100 60 140

PLC-H checkers count

Area cost

0.05% 0.4%

0.05% - 0.4%

area impact

Erro

r d

etec

tio

n l

aten

cy (

cycl

es)

Fast QED

105X quicker detection

2X coverage

No intrusiveness

Runtime: 1.04X – 6X

MBIST reuse

Core, uncore, power management bugs

Uncore Bugs

No boot

Pass

48 processor cores

0.9V, 800 MHz

QED unique detect

QED enhanced detect

QED quick detect

Error detection latency (cycles)

Cu

mu

lati

ve b

ugs

det

ecte

d

100 1k 10k 100k 1M 10M

0%

20%

40%

60%

80%

100%

104X2X

OriginalMed., Max. EDL:241k, 10M

QEDMed., Max. EDL: 675, 8k

Difficult Logic Bugs

QED Techniques

Hybrid QED

Error Detection Latency (cycles)

Co

vera

ge (

per

cen

tage

)

1 10 100 1k 10k 100k 1M 10M0%

20%

40%

60%

80%

100%Hybrid QED: Mean EDL= 705 cycles

OriginalMean EDL =124k cycles

102X

Improved

Accelerator validation and debug

Using high-level synthesis

Collaborator: Prof. Deming Cheng (UIUC)

0%

20%

40%

60%

80%

100%

0 100 1K 10K 100k 1M

Cu

mu

lati

ve b

ugs

det

ecte

d

Bug Trace Length (cycles)>10M

OriginalMin., Mean, Max.: 722, 1.9M, 11M

Symbolic QEDMin., Mean, Max.: 13, 20, 29

106X

2X

BMC ToolAutomaticallyOvernight

1. “Universal” PropertyQED Check + Initial State

Logic Bugs Localized

2. Partial Instances +QED Modules

1. “Universal” Property: QED CheckWhat property should the BMC tool check?

2. Partial InstantiationHow to ensure the design fits in the BMC tool?

CMP Ra == Ra’

QED checks are Compositional

Not design/implementation specific

Preserved across partial instances

Unlike tradition properties

Systematically instantiate only the modules needed to activate the bug

BMC tool finds a bug trace

Core

1

Core

0

Core

2

Core

3

Core

4

Core

5

Core

6

Core

7

L2Bank 0

L2Bank 1

L2Bank 2

L2Bank 3

L2Bank 4

L2Bank 5

L2Bank 6

L2Bank 7

Memory

controller 0

Memory

controller 1

Memory

controller 2

Memory

controller 3

I/O

controllers

Crossbar interconnect

Core

0

L2Bank 0

Crossbar

interconnect

Core

0

L2Bank 0

Memory

controller 0

Crossbar

interconnect

Core

1

Core

0

L2Bank 0

Crossbar

interconnect

Memory

controller 0

Reduce InstancesKeep at least 1 core

Run Each

No Trace Found Trace Found Trace FoundBest Localization