lifetime reliability-aware task allocation and scheduling for mpsoc platforms lin huang, feng yuan...

29
Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department of Computer Science & Engineering The Chinese University of Hong Kong DATE’09

Post on 20-Jan-2016

231 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Lifetime Reliability-Aware Task Allocation and Scheduling for

MPSoC Platforms

Lin Huang, Feng Yuan and Qiang XuReliable Computing Laboratory

Department of Computer Science & EngineeringThe Chinese University of Hong Kong

DATE’09

Page 2: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Lifetime Reliability of Embedded Multiprocessor Platform

Multiprocessor system-on-a-chip (MPSoC)• Platform-based design

• Hardware / software co-synthesis

Reliability issue• IC product wear-out lifetime reliability threats

• Time dependent dielectric breakdown (TDDB), electromigration (EM), stress migration (SM), negative bias temperature instability (NBTI)

• Soft errors

Page 3: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Prior Work

Prior work in reliability-driven task allocation and scheduling• Constant failure rate

Limitation of thermal-aware task scheduling• Might improve the system’s lifetime reliability implicitly

• Not readily applicable, especially for heterogeneous MPSoC

Page 4: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Problem Motivation Example

Electromigration

Suppose , and all other

parameters are the same

P1 ages much faster than P2,

dominating the MPSoC lifetime

P1

P2

MPSoC Platform

Page 5: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Problem Formulation Task allocation and scheduling

Output

Aim: to maximize the expected service life (mean time to failure, MTTF) of the MPSoC system under the performance constraint

P1P2

MPSoC PlatformT0

T1

T2

T3

T4

TaskGraph Binding &

Scheduling

T0

P1

P2T1

T2

T3

T4 PeriodicalSchedule

Page 6: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Lifetime Reliability Estimation Electromigration

Denote by the reliability of a single processor at time Expected service life Weibull distribution

TemperatureVariationExample

Computed by existing hard error models

Reflect some important factors (e.g., architecture properties)

Page 7: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Main Approach– Simulated Annealing Solution representation

• (schedule order sequence; resource assignment sequence)

• For example, (0, 1, 3, 2, 4; P2, P2, P2, P1, P1)

• Schedule order sequence: partial order defined by task graph

• Every solution corresponds to a feasible schedule

Schedule Reconstruction

T0

P1

P2T1

T2

T3

T4PeriodicalSchedule

Page 8: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Main Approach– Simulated Annealing Transforms of directed acyclic graph

• Expanded task graph

• Undirected complement graph

Lemma: Given a valid schedule order , swapping adjacent nodes leads to another valid schedule order, provided there is an edge between these two nodes in the complement graph

T0 T1

Task Graph

T2 T3 T4

T0 T1

Expanded Task Graph

T2 T3 T4

T0 T1

Complement Graph

T2 T3 T4

Page 9: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Main Approach– Simulated Annealing Theorem: Starting from a valid schedule order

we are able to reach any other valid schedule orderafter finite times of adjacent swapping

• For example 2 3 0 4 1

0 2 3 1 4

T0 T1

Task Graph

T2 T3 T4

T0 T1

Expanded Task Graph

T2 T3 T4

T0 T1

Complement Graph

T2 T3 T4

2 0 3 4 1

0 2 3 4 1

Page 10: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Main Approach– Simulated Annealing Moves

• M1: Swap two adjacent nodes in both schedule order sequence and resource assignment sequence, if there is an edge between these two nodes in the complement graph

• M2: Swap two adjacent nodes in resource assignment sequence

• M3: Change the resource assignment of a task

T0 T1

Task Graph

T2 T3 T4

T0 T1

Expanded Task Graph

T2 T3 T4

T0 T1

Complement Graph

T2 T3 T4

Page 11: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Main Approach– Simulated Annealing Three moves are defined, so that

• Starting from a valid schedule order A, we are able to reach any other valid schedule order B after finite times of adjacent swapping

Cost function

• First term guarantees a schedule meet all tasks’ deadlines

• Second term indicates the system lifetime

Significant large

Page 12: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Main Approach– Simulated Annealing Key problem: Computation time

Source of time overhead• Run temperature simulator EVERY TIME

we reach a new solution• Simulator is called 3×105 times

• Every time trace the temperature variationfor entire service life

• In range of years

• Accurate calculation requires fine-grained variation trace file

• Significant / within very short time

An efficient cost computation strategy is essential !

initial temperature 102

end temperature 10-5

cooling rate 0.95

iteration 103

SA parameters

Page 13: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Revisit System Lifetime Reliability Estimation – Speedup II

It will be better if we are able to compute MTTF by tracing the temperature variation of only one period

Page 14: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Revisit System Lifetime Reliability Estimation – Speedup II

A subdivision of time

……

Page 15: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Revisit System Lifetime Reliability Estimation – Speedup II

Given

Aging effect in one period

Property: does not vary from period to period

This property enables us to trace the temperature variation of only ONE period

Page 16: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Revisit System Lifetime Reliability Estimation – Speedup II

The expected service life of one processor is

Provided no redundant processors in the system, expected service life of entire system is

Page 17: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Revisit System Lifetime Reliability Estimation – Speedup IIII

Given

Instead of computing theaging effect in every period,we propose to compute theaging effect of periods atone time

Page 18: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Revisit System Lifetime Reliability Estimation – Speedup IIIIII

Accurate calculation requests setting the length of time intervals as very small value

Use steady temperature rather than accurate temporal temperature

TemperatureVariationExample

TaskScheduleP 1

P 2P 3

t

Task Type 1 Task Type 2Legend

Page 19: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Revisit System Lifetime Reliability Estimation – Speedup IVIV

Need to run temperature simulator every time we reach a new solution

There can be at most kinds of processor usage combinations in task schedules

• Given = 3, = 4, we need only 255 times pre-computation, each for a steady temperature

Estimate processors’ temperature for various processor usage combinations in pre-calculation phase only

P 1P 2

P 3

t

Task Type 1 Task Type 2Legend

Page 20: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Revisit System Lifetime Reliability Estimation – Speedup IVIV

Time slot

• The set of under-used processors

• The power consumption of the tasks running on these processors

• Categorize the tasks into types according to power consumption

• E.g.,

Processor index under usage

Task power consumption type

Page 21: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Revisit System Lifetime Reliability Estimation – Speedup IVIV

Pre-calculate the steady temperature of processor in time slot

The aging effect in unit time in this case is therefore

The aging effect of P1 in this schedule in a period is

Page 22: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Revisit System Lifetime Reliability Estimation – Summary

A summary of speedup techniques

• Rewrite MTTF expression in terms of aging effect in one period

• Compute the aging effect of several periods at one time

• Approximate aging effect in one period based on the task changes and using steady temperature

• Call temperature estimation simulator in the pre-calculation phase only

The time consumption of pre-calculation can be even reduced

Page 23: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Experimental Setup Random task graphs generated by TGFF

• Task numbers range from 20 to 260 Hypothetical MPSoC platforms

• Processor core numbers range from 2 to 8

• Homogeneous / Heterogeneous Take electromigration model in [Goel-IEEEPress07] as example

• Note that, our model also applied to other failure mechanisms Compare our method with a thermal-aware task scheduling

algorithm proposed in [Xie-JVLSISP06]

Page 24: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Accuracy Comparison between approximated MTTF and accurate value

Page 25: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Lifetime Reliability of Various Platforms with Various Task Graphs

Platform Description

Task Description Dead

line

Thermal-

aware

Simulated Annealing

0% DR 5% DR 10%DR

M-PE Co-PE Task Edge MTTF MTTF Δ(%) MTTF Δ(%) MTTF Δ(%)

2 0 22 23 535 492.5 492.5 0 582.3 18.2 582.3 18.2

4 049 76

1106 216.1 226.9 5.0 247.3 14.4 263.4 21.9

2 2 697 137.4 161.3 17.4 171.2 24.6 185.6 35.0

6 076 106

918 228.9 239.9 4.8 256.7 12.2 273.3 19.4

2 4 676 97.2 125.1 28.7 137.9 41.9 150.0 54.4

8 0131 190

1227 227.2 235.8 3.8 250.9 10.4 265.6 16.9

2 6 984 88.0 130.4 48.2 143.7 63.3 160.0 81.8Δ: Difference ratio between MTTF of simulated annealing and that of thermal aware

DR: Deadline Relaxation

Page 26: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Lifetime Reliability of 8-Processor Platforms

Task Description

8 Core Homogenous Platform 8 Core Heterogeneous Platform

Thermal Aware

Simulated AnnealingThermal Aware

Simulated Annealing

DR (%) MTTF Δ(%) DR (%) MTTF Δ(%)

Task #: 101

Edge #: 142

Deadline: 1059

MTTF: 240.1

0 247.8 3.2 Deadline: 809

MTTF: 91.6

0 129.0 40.8

5 264.3 10.1 5 146.0 59.3

10 279.6 16.5 10 160.5 75.4

Task #: 131

Edge #: 190

Deadline: 1227

MTTF: 227.2

0 235.8 3.8 Deadline: 984

MTTF: 88.0

0 130.4 48.2

5 250.9 10.4 5 143.7 63.3

10 265.6 16.9 10 160.0 81.8

Task #: 251

Edge #: 366

Deadline: 2014

MTTF: 191.4

0 203.4 6.3 Deadline: 1693

MTTF: 85.7

0 124.2 44.9

5 216.6 13.2 5 137.9 60.8

10 230.2 20.3 10 151.1 76.3

Page 27: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Efficiency The simulated annealing process requests 50-200s of CPU

time on Intel(R) Core(TM) 2 CPU 2.13GHz for each case

• 4 processors 49 tasks – 84s

• 8 processors 101 tasks – 158s

The CPU time spending on pre-calculation ranges from 3s to 160s

Page 28: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Conclusion Technology advancement has brought with adverse impact of

on lifetime reliability of MPSoC embedded systems Prior work on task allocation and scheduling does not explicitly

take wearout failure into account We propose an analytical modelan analytical model to estimate the lifetime

reliability of multiprocessor platforms under periodical tasks We present a novel lifetime reliability-aware algorithma novel lifetime reliability-aware algorithm based on

simulated annealing technique We propose several speedup techniquesseveral speedup techniques to simplify the design

space exploration process with satisfactory solution quality Experimental results demonstrate the effectiveness

Page 29: Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms Lin Huang, Feng Yuan and Qiang Xu Reliable Computing Laboratory Department

Lifetime Reliability-Aware Task Allocation and Scheduling for MPSoC Platforms

Thank you for your attention !