architecture-parametric timing...
TRANSCRIPT
A r c h i t e c t u r e - P a r a m e t r i c T i m i n g A n a l y s i s
J a n R e i n e k e !J o h a n n e s D o e r f e r t
20th IEEE Real-Time and Embedded Technology and Applications Symposium April 15-17, 2014 Berlin, Germany
computer science
saarlanduniversity
A r c h i t e c t u r e - C o n f i g u r a t i o n C h a l l e n g e : A t D e s i g n T i m e
2
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared Cache
A r c h i t e c t u r e - C o n f i g u r a t i o n C h a l l e n g e : A t D e s i g n T i m e
2
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared Cache
Maximal Frequency
A r c h i t e c t u r e - C o n f i g u r a t i o n C h a l l e n g e : A t D e s i g n T i m e
2
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared Cache
Maximal Frequency
Size
Size
A r c h i t e c t u r e - C o n f i g u r a t i o n C h a l l e n g e : A t D e s i g n T i m e
2
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared Cache
Maximal Frequency
Latency, Bandwidth
Size
Size
A r c h i t e c t u r e - C o n f i g u r a t i o n C h a l l e n g e : A t R u n t i m e
3
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared Cache
Dynamic Frequency/
Voltage Scaling
TDMA Configuration
Partitioning, Turn off?
A r c h i t e c t u r e - C o n f i g u r a t i o n C h a l l e n g e : M a n y - C o r e
4
Network on Chip
Main Memory
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared CacheConfiguration
Latency, Bandwidth
A r c h i t e c t u r e - C o n f i g u r a t i o n C h a l l e n g e : M a n y - C o r e
4
Network on Chip
Main Memory
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared CacheConfiguration
Latency, Bandwidth
Configuration affects implementation cost, energy consumption, and worst-case execution times!
A r c h i t e c t u r e - P a r a m e t r i c T i m i n g A n a l y s i s
5
Network on Chip
Main Memory
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared Cache
Embedded Software
Configurable Platform
+
Parametric WCET
?
A r c h i t e c t u r e - P a r a m e t r i c T i m i n g A n a l y s i s
5
Network on Chip
Main Memory
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Core
Local Cache
Bus
Shared Cache
Embedded Software
Configurable Platform
+
Parametric WCET
?
Desiderata: • Precise • Efficiently evaluable
A r c h i t e c t u r e - P a r a m e t r i c T i m i n g A n a l y s i s : “ B l a c k - B o x ” A p p r o a c h
6
Conventional non-parametric timing analysis
Configuration Configuration
WCET
Black-Box WCET
Analysis
Software Binary
Configuration 1
Configuration 2
Configuration 3
Configuration 4
Configuration 5
Configuration 1
WCET 1
Black-Box WCET
Analysis
Generalize from
Examples
Parametric WCET
Software Binary
Configuration 2
WCET 2Configuration
3WCET 3
Configuration 4
WCET 4Configuration
5WCET 5
A r c h i t e c t u r e - P a r a m e t r i c T i m i n g A n a l y s i s : “ B l a c k - B o x ” A p p r o a c h
7
Configuration 1
Configuration 2
Configuration 3
Configuration 4
Configuration 5
Configuration 1
WCET 1
Black-Box WCET
Analysis
Generalize from
Examples
Parametric WCET
Software Binary
Configuration 2
WCET 2Configuration
3WCET 3
Configuration 4
WCET 4Configuration
5WCET 5
A r c h i t e c t u r e - P a r a m e t r i c T i m i n g A n a l y s i s : “ B l a c k - B o x ” A p p r o a c h
7
1. How to generalize from examples?
Configuration 1
Configuration 2
Configuration 3
Configuration 4
Configuration 5
Configuration 1
WCET 1
Black-Box WCET
Analysis
Generalize from
Examples
Parametric WCET
Software Binary
Configuration 2
WCET 2Configuration
3WCET 3
Configuration 4
WCET 4Configuration
5WCET 5
A r c h i t e c t u r e - P a r a m e t r i c T i m i n g A n a l y s i s : “ B l a c k - B o x ” A p p r o a c h
7
1. How to generalize from examples?
2. Where to sample black box?
R e q u i r e m e n t s f o r S o u n d a n d E f f i c i e n t G e n e r a l i z a t i o n
8
Necessary: Execution times should be monotone in parameters: “higher frequencies yield shorter execution times” “smaller caches yield longer execution times”
Desirable for efficiency: Execution time should depend linearly on parameters: “doubling the processor frequency will decrease execution time by a factor of two”
R e q u i r e m e n t s f o r S o u n d a n d E f f i c i e n t G e n e r a l i z a t i o n
9
|cache|=0c=1m=0 WCET=13
|cache|=0c=0m=1 WCET=5
|cache|=4096c=1m=0 WCET=10
|cache|=4096c=0m=1 WCET=3
|cache| >= 4096
WCET= 10*c+3*m
WCET= 13*c+5*m
Yes NoGeneralize
from Examples
1 . H o w t o G e n e r a l i z e f r o m E x a m p l e s ? R e d u c t i o n t o P a r a m e t r i c L i n e a r P r o g r a m m i n g
10
Formulate a series of parametric linear programs, encoding: • Configurations/WCETs obtained from Black Box • Properties that allow to generalize
|cache|=0c=1m=0 WCET=13
|cache|=0c=0m=1 WCET=5
|cache|=4096c=1m=0 WCET=10
|cache|=4096c=0m=1 WCET=3
|cache| >= 4096
WCET= 10*c+3*m
WCET= 13*c+5*m
Yes NoGeneralize
from Examples
1 . H o w t o G e n e r a l i z e f r o m E x a m p l e s ? R e d u c t i o n t o P a r a m e t r i c L i n e a r P r o g r a m m i n g
10
Formulate a series of parametric linear programs, encoding: • Configurations/WCETs obtained from Black Box • Properties that allow to generalize
See paper for details!
Configuration 1
Configuration 2
Configuration 3
Configuration 4
Configuration 5
Configuration 1
WCET 1
Black-Box WCET
Analysis
Generalize from
Examples
Parametric WCET
Software Binary
Configuration 2
WCET 2Configuration
3WCET 3
Configuration 4
WCET 4Configuration
5WCET 5
“ B l a c k - B o x ” A p p r o a c h
11
2. Where to sample black box?
2 . W h e r e t o s a m p l e t h e b l a c k b o x ?
12
Wanted: Small set of configurations that yields precise parametric WCET
fast analysis
= “close” to black-box
“everywhere”
2 . W h e r e t o s a m p l e t h e b l a c k b o x ? I n c r e m e n t a l S a m p l i n g
13
Software Binary
Generalize from
Examples
Black-Box WCET
Analysis
Parametric WCET
Parametric Lower Bound
Difference less than epsilon?
Return Parametric
WCET Bound
Violating Configuration
Configuration 1 WCET 1
Configuration 2 WCET 2
Configuration 3 WCET 3
Configuration 4 WCET 4
Configuration 5 WCET 5
2 . W h e r e t o s a m p l e t h e b l a c k b o x ? I n c r e m e n t a l S a m p l i n g
13
Theorem: Algorithm terminates.
Software Binary
Generalize from
Examples
Black-Box WCET
Analysis
Parametric WCET
Parametric Lower Bound
Difference less than epsilon?
Return Parametric
WCET Bound
Violating Configuration
Configuration 1 WCET 1
Configuration 2 WCET 2
Configuration 3 WCET 3
Configuration 4 WCET 4
Configuration 5 WCET 5
Ta r g e t f o r P r o t o t y p e : A P a r a m e t e r i z e d P r e c i s i o n - T i m e d A r c h i t e c t u r e
14
Parameterized version of the PTARM, a predictable microarchitecture developed within the PRET project. !6 parameters that control • latencies of arithmetic and branch instructions, • latencies of loads and stores to the scratchpads and to DRAM, • sizes of instruction and data scratchpads. !
Ta r g e t f o r P r o t o t y p e : A P a r a m e t e r i z e d P r e c i s i o n - T i m e d A r c h i t e c t u r e
14
Parameterized version of the PTARM, a predictable microarchitecture developed within the PRET project. !6 parameters that control • latencies of arithmetic and branch instructions, • latencies of loads and stores to the scratchpads and to DRAM, • sizes of instruction and data scratchpads. !For experimental evaluation: • Black-box WCET analysis based on OTAWA • Parameterized PTARM simulator
E x p e r i m e n t a l E v a l u a t i o n : P r e c i s i o n o f B l a c k B o x
15
Mälardalen benchmarks minus floating-point, recursion, complex switch statements
Black Box/Simulator
TABLE II. BRIEF SUMMARY OF THE BENCHMARKS.Name Size [byte] Brief descriptionadpcm 26852 Adaptive pulse code modulation algorithm.bs 4248 Binary search for the array of 15 integer elements.bsort100 2779 Bubblesort program.crc 5168 Cyclic redundancy check computation on 40 bytes of data.fdct 8863 Fast Discrete Cosine Transform.fibcall 3499 Simple iterative Fibonacci calculation, to calculate fib(30).insertsort 3892 Insertion sort on a reversed array of size 10.janne complex 1564 Nested loop program.jfdctint 16028 Discrete-cosine transformation on a 8x8 pixel block.matmult 3737 Matrix multiplication of two 20x20 matrices.ns 10436 Search in a multi-dimensional array.nsichneu 11835 Simulate an extended Petri Net.qsort-exam 4535 Non-recursive version of quick sort algorithm.statemate 52618 Automatically generated code.
GNU ARM toolchain including GCC version 4.3.2. Timemeasurements were performed on an INTEL CORE I7 920running at 2.67 GHz with 12 GB of RAM. We chose 64 KBas the maximal value for both the instruction and the datascratchpad memories. Linear parameters, modeling latencies,may take any rational value between 0 and 10.
B. Evaluation Results
Our first evaluation goal is to confirm that the black-box WCET analysis over-approximates the timing of theparameterized PTARM, and to evaluate its precision. To thisend, we determine for each benchmark the ratio between theblack-box WCET estimate and the execution time determinedusing the PTARM simulator in a single simulation run with theinputs that are provided with the MALARDALEN benchmarks.As the value analysis in the black box is very simple and thusbound to be imprecise, we perform this comparison with alllinear parameters, including the DRAM latencies, set to 1. Thiseliminates the influence of the value analysis from the results.The results of this analysis are illustrated in Table III. For somebenchmarks the black-box estimate is very close to the simula-tion result, yet for others the ratio is extremely large. This is dueto imprecise loop bounds and other constraints on the controlflow, and to the fact that the input exercised during simulationdoes not represent the worst-case input, e.g., in the sorting tasks.
Next, we evaluate how the number of black-box WCETsamples affects the precision of the parametric analysis resultson unsampled parameter vectors. To this end, we modifyAlgorithm 3 to report upper and lower bounds whenever a newsample has been taken. This yields two ASTs, �
P,i
and �
P,i
,corresponding to the lower and upper bounds on the black box,for each benchmark P in the set of benchmarks P and numberof samples i. Then, we sample the parameter space uniformly atrandom 100 times. For each of the randomly drawn parametervaluations (�
j
,µ
j
), we evaluate the black box BBP
, and theupper and lower bounds �
P,i
,�
P,i
, and determine their ratios:
r
overP,i,j
:=
J�P,i
K(�j
,µ
j
)
BBP
(�
j
,µ
j
)
and r
underP,i,j
:=
J�P,i
K(�j
,µ
j
)
BBP
(�
j
,µ
j
)
.
We summarize these ratios by taking their geometric meansr
overi
, r
underi
over all benchmarks found in Table III. InFigure 3, we depict r
overi
and r
underi
for i between two2 and26. The experiment was performed with a precision targetof (✏ = 1024, ⌧ = (0, 0))
3, which ensures that an arbitrarynumber of samples can be taken. We observe a strong precision
2We start at two samples, because Algorithm 3 performs two samples beforeentering the refinement loop.
3Where ⌧ refers to the monotone parameters µI-SPM-Size and µD-SPM-Size.
TABLE III. PRECISION OF THE BLACK-BOX WCET ANALYSIS.
Name Black Box (cycles) Simulator (cycles) Ratioadpcm 9989637 1598152 6.25bs 318 279 1.14bsort100 998109 8293 120.36crc 248231 116995 2.12fdct 11262 11069 1.02fibcall 1140 1131 1.01insertsort 4965 2949 1.68janne complex 4048 753 5.38jfdctint 14016 13951 1.00matmult 755274 745669 1.01ns 42550 42549 1.00nsichneu 32339 15551 2.08qsort-exam 2132100 11125 191.65statemate 108766 2809 38.72
2 3 4 5 6 7 8 9 10111213141516171819202122232425260
1
2
Number of samples
Rat
io
roveri
runderi
roveri,random
Fig. 3. Ratios between upper bound and black box roveri , rover
i,random and ratiobetween lower bound and black box runder
i in terms of the number of samples i.
improvement on samples 3 to 7. After the first 7 samples, theobtained lower bound is very close to the actual black-boxvalues for all benchmarks. The upper bound comes within 5%
of the black box after 16 samples for most benchmarks, whichis reflected by the geometric mean in the figure. For mostbenchmarks, the algorithm chooses to sample the black box ata different instruction scratchpad size than before, at samples12 and 16, which yields a significant precision improvement.
As a baseline for Algorithm 3, we determine how well upperbounds based on a random set of samples approximate the blackbox. In Figure 3, rover
i,random denotes the ratio between these upperbounds, based on i random samples, and the black box. Randomsampling yields much less precise estimates. In addition,computation times increase dramatically, which explains whywe only perform this experiment for up to 15 samples. Thisdemonstrates that some form of “intelligent” sampling isrequired for the black-box approach to be precise and efficient.
To evaluate analysis efficiency, we determined the analysistime of each benchmark up to and including the i
th sample.We decompose this analysis time into three components:
1) Invocations of the black box.2) Operations on affine selection trees, e.g. minimization.3) Invocations of PIPLIB.
In Figure 4 we show the geometric mean of the analysis timeover all benchmarks up to the i
th sample. The bars are stacked,meaning that the top of the upper most bar reflects the overallanalysis time. As the black box is called once for each sample,its contribution to the overall analysis time grows linearly. Asuperlinear growth is observed for the other two components,which is expected, as the problems to be solved grow witheach sample. PIPLIB’s contribution grows strongest, movingfrom the smallest to the largest share of analysis time. Asobserved in Figure 3, 16 samples usually result in very preciseparametric upper bounds. On our benchmarks, 16 samples areprocessed in less than 2.2 seconds on the average.
E x p e r i m e n t a l E v a l u a t i o n : P r e c i s i o n i n Te r m s o f N u m b e r o f S a m p l e s
16
2 3 4 5 6 7 8 9 10111213141516171819202122232425260
1
2
Number of samples
Rat
io
Fig. 4. Ratios between upper bound and black box roveri , rover
i,random and ratiobetween lower bound and black box runder
i in terms of the number of samples i.
10 20 30 400
10,000
20,000
Number of samples
Ana
lysi
sTi
me
(inm
s)
PIPLIB
AST OperationsBlack Box
Fig. 5. Analysis time in terms of number of samples.
observed in Figure 4, 16 samples usually result in very preciseparametric upper bounds. On our benchmarks, 16 samples areprocessed in less than 2.2 seconds on the average.
In Figure 4 we have shown how the actual precisiondepends on the number of samples. Next, we determinehow many samples are required to meet a certain precisionguarantee. To this end, we determine for each benchmarkand for several precision targets (✏, (⌧, ⌧))
4 the number ofsamples s
P
(✏, ⌧) that Algorithm 3 takes until termination.We summarize these values in Figure 6. Three benchmarks,nsichneu, adpcm, statemate, ran out of memory forsmaller values of ⌧ . For the tightest precision requirement,✏ = 32, we report the maximum required number of sampless
max32
(⌧) = max
P2P s
P
(32, ⌧) over all benchmarks for whichthe analysis terminated successfully, for ⌧ between 8192 and262144. For the loosest precision requirement, ✏ = 1024, onthe other hand, we report the minimum required number ofsamples s
min1024
(⌧) over all benchmarks. For ✏ between 32 and1024, s
P
(✏, ⌧) is expected to lie between s
min1024
(⌧) and s
max32
(⌧).We also report the median number of samples s
median256
(⌧) overall benchmarks for ✏ = 256. For most benchmarks, the numberof required samples is quite insensitive to ✏: the median for✏ = 256 is close to the minimum for ✏ = 1024.
VIII. APPLICATION TO COMMERCIALMICROARCHITECTURES
We have applied APTA to a parameterized version ofthe PTARM, a precision-timed architecture. It has beendesigned with the specific goal of reconciling performanceand predictability. And indeed, as we demonstrate, precise andefficient architecture-parametric WCET analysis is feasible forthe PTARM. Naturally, the question arises whether our approachis also applicable to other existing academic and commercial
4Where (⌧, ⌧) refers to the monotone parameters µI-SPM-Size and µD-SPM-Size.
262144 131072 65536 32768 16384 81920
50
100
excl. nsichneu
excl. adpcm, statemate
Precision Requirement ⌧
Num
ber
ofsa
mpl
es
Fig. 6. Number of samples required to reach precision guarantee (✏, (⌧, ⌧)).Some benchmarks, namely nsichneu, adpcm, statemate, ran out ofmemory for ✏ = 32 and are thus not included in smax
32 (⌧). The median andminimum, smedian
256 (⌧) and smin1024(⌧), could be determined for all values of ⌧ .
microarchitectures and whether a similar parameterization isreasonable for such microarchitectures.
To answer the second question first, we believe that the pa-rameterization of the PTARM is quite typical for both commer-cial and research platforms: aspects that are often configurablein single-core processors are the processor frequency, the sizesof local memories (caches or scratchpads), and the interconnect,affecting memory access latencies, corresponding directly to theparameters in the PTARM. To use multi-core architectures in ahard real-time context, most of their shared resources will haveto be partitioned (an approach promoted among others in theMERASA and Predator projects [15], [3]) in space and/or time:caches can be partitioned along their ways [16], buses and otherinterconnect can be partitioned in time by time division multipleaccess (TDMA) arbitration [17], and access to DRAM memorycan be partitioned in time [18] and space [10]. Partitioningin space intuitively induces monotone parameters, whereastime-based partitioning may sometimes be modeled with linearparameters, however, caveats exist, as discussed below.
For our approach to be applicable to a particular platform,its parameterized timing model needs to be linear or at leastmonotone in all of its parameters. This is, unfortunately, not thecase for canonical models of many complex microarchitectures.In contrast to LRU, FIFO cache replacement is known tosuffer from Belady’s anomaly, i.e., under FIFO, increasingthe cache’s size may lead to a decrease in performance. Forplatforms including such non-monotone features, monotoneparameterized timing models can be developed, however, atthe cost of a loss in precision. Alternatively, if the goal is tofind a system configuration statically, our analysis may alsobe performed on timing models that are not guaranteed to bemonotone. This yields parametric WCET estimations that arenot necessarily safe. Once a system configuration has beendetermined based on such an estimation, the black box canstill be used to verify whether the timing constraints can bemet. If not, the parametric WCET estimation can be refinedaccordingly and the process would have to be iterated.
In addition to monotonicity, our approach requires timing tobe decomposed into contributions that can be attributed to differ-ent components. Such a decomposition is natural for so-calledfully timing-compositional architectures [3], [19]. An exampleof a fully timing-compositional commercial architecture is theARM7 [3]. For more complex architectures such as the InfineonTriCore or the PowerPC 755, it is possible to create conservativecompositional models. The resulting loss in precision may, how-ever, be substantial and depends on the particular architectureand decomposition [19], and is a topic of ongoing research.
Parametric WCET/Black Box
Parametric Lower Bound/Black Box
Geometric Mean
TABLE II. BRIEF SUMMARY OF THE BENCHMARKS.Name Size [byte] Brief descriptionadpcm 26852 Adaptive pulse code modulation algorithm.bs 4248 Binary search for the array of 15 integer elements.bsort100 2779 Bubblesort program.crc 5168 Cyclic redundancy check computation on 40 bytes of data.fdct 8863 Fast Discrete Cosine Transform.fibcall 3499 Simple iterative Fibonacci calculation, to calculate fib(30).insertsort 3892 Insertion sort on a reversed array of size 10.janne complex 1564 Nested loop program.jfdctint 16028 Discrete-cosine transformation on a 8x8 pixel block.matmult 3737 Matrix multiplication of two 20x20 matrices.ns 10436 Search in a multi-dimensional array.nsichneu 11835 Simulate an extended Petri Net.qsort-exam 4535 Non-recursive version of quick sort algorithm.statemate 52618 Automatically generated code.
GNU ARM toolchain including GCC version 4.3.2. Timemeasurements were performed on an INTEL CORE I7 920running at 2.67 GHz with 12 GB of RAM. We chose 64 KBas the maximal value for both the instruction and the datascratchpad memories. Linear parameters, modeling latencies,may take any rational value between 0 and 10.
B. Evaluation Results
Our first evaluation goal is to confirm that the black-box WCET analysis over-approximates the timing of theparameterized PTARM, and to evaluate its precision. To thisend, we determine for each benchmark the ratio between theblack-box WCET estimate and the execution time determinedusing the PTARM simulator in a single simulation run with theinputs that are provided with the MALARDALEN benchmarks.As the value analysis in the black box is very simple and thusbound to be imprecise, we perform this comparison with alllinear parameters, including the DRAM latencies, set to 1. Thiseliminates the influence of the value analysis from the results.The results of this analysis are illustrated in Table III. For somebenchmarks the black-box estimate is very close to the simula-tion result, yet for others the ratio is extremely large. This is dueto imprecise loop bounds and other constraints on the controlflow, and to the fact that the input exercised during simulationdoes not represent the worst-case input, e.g., in the sorting tasks.
Next, we evaluate how the number of black-box WCETsamples affects the precision of the parametric analysis resultson unsampled parameter vectors. To this end, we modifyAlgorithm 3 to report upper and lower bounds whenever a newsample has been taken. This yields two ASTs, �
P,i
and �
P,i
,corresponding to the lower and upper bounds on the black box,for each benchmark P in the set of benchmarks P and numberof samples i. Then, we sample the parameter space uniformly atrandom 100 times. For each of the randomly drawn parametervaluations (�
j
,µ
j
), we evaluate the black box BBP
, and theupper and lower bounds �
P,i
,�
P,i
, and determine their ratios:
r
overP,i,j
:=
J�P,i
K(�j
,µ
j
)
BBP
(�
j
,µ
j
)
and r
underP,i,j
:=
J�P,i
K(�j
,µ
j
)
BBP
(�
j
,µ
j
)
.
We summarize these ratios by taking their geometric meansr
overi
, r
underi
over all benchmarks found in Table III. InFigure 4, we depict r
overi
and r
underi
for i between two2 and26. The experiment was performed with a precision targetof (✏ = 1024, ⌧ = (0, 0))
3, which ensures that an arbitrarynumber of samples can be taken. We observe a strong precision
2We start at two samples, because Algorithm 3 performs two samples beforeentering the refinement loop.
3Where ⌧ refers to the monotone parameters µI-SPM-Size and µD-SPM-Size.
TABLE III. PRECISION OF THE BLACK-BOX WCET ANALYSIS.
Name Black Box (cycles) Simulator (cycles) Ratioadpcm 9989637 1598152 6.25bs 318 279 1.14bsort100 998109 8293 120.36crc 248231 116995 2.12fdct 11262 11069 1.02fibcall 1140 1131 1.01insertsort 4965 2949 1.68janne complex 4048 753 5.38jfdctint 14016 13951 1.00matmult 755274 745669 1.01ns 42550 42549 1.00nsichneu 32339 15551 2.08qsort-exam 2132100 11125 191.65statemate 108766 2809 38.72
2 3 4 5 6 7 8 9 10111213141516171819202122232425260
1
2
Number of samples
Rat
io
Fig. 3. Ratios between upper bound and black box roveri , rover
i,random and ratiobetween lower bound and black box runder
i in terms of the number of samples i.
improvement on samples 3 to 7. After the first 7 samples, theobtained lower bound is very close to the actual black-boxvalues for all benchmarks. The upper bound comes within 5%
of the black box after 16 samples for most benchmarks, whichis reflected by the geometric mean in the figure. For mostbenchmarks, the algorithm chooses to sample the black box ata different instruction scratchpad size than before, at samples12 and 16, which yields a significant precision improvement.
As a baseline for Algorithm 3, we determine how well upperbounds based on a random set of samples approximate the blackbox. In Figure 4, rover
i,random denotes the ratio between these upperbounds, based on i random samples, and the black box. Randomsampling yields much less precise estimates. In addition,computation times increase dramatically, which explains whywe only perform this experiment for up to 15 samples. Thisdemonstrates that some form of “intelligent” sampling isrequired for the black-box approach to be precise and efficient.
To evaluate analysis efficiency, we determined the analysistime of each benchmark up to and including the i
th sample.We decompose this analysis time into three components:
1) Invocations of the black box.2) Operations on affine selection trees, e.g. minimization.3) Invocations of PIPLIB.
In Figure 5 we show the geometric mean of the analysis timeover all benchmarks up to the i
th sample. The bars are stacked,meaning that the top of the upper most bar reflects the overallanalysis time. As the black box is called once for each sample,its contribution to the overall analysis time grows linearly. Asuperlinear growth is observed for the other two components,which is expected, as the problems to be solved grow witheach sample. PIPLIB’s contribution grows strongest, movingfrom the smallest to the largest share of analysis time. As
E x p e r i m e n t a l E v a l u a t i o n : Ve r s u s R a n d o m S a m p l i n g
17
Random Samples
E x p e r i m e n t a l E v a l u a t i o n : A n a l y s i s T i m e i n Te r m s o f N u m b e r o f S a m p l e s
18
2 3 4 5 6 7 8 9 10111213141516171819202122232425260
1
2
Number of samplesR
atio
Fig. 4. Ratios between upper bound and black box roveri , rover
i,random and ratiobetween lower bound and black box runder
i in terms of the number of samples i.
10 20 30 400
10,000
20,000
Number of samples
Ana
lysi
sTi
me
(inm
s)
PIPLIB
AST OperationsBlack Box
Fig. 5. Analysis time in terms of number of samples.
observed in Figure 4, 16 samples usually result in very preciseparametric upper bounds. On our benchmarks, 16 samples areprocessed in less than 2.2 seconds on the average.
In Figure 4 we have shown how the actual precisiondepends on the number of samples. Next, we determinehow many samples are required to meet a certain precisionguarantee. To this end, we determine for each benchmarkand for several precision targets (✏, (⌧, ⌧))
4 the number ofsamples s
P
(✏, ⌧) that Algorithm 3 takes until termination.We summarize these values in Figure 6. Three benchmarks,nsichneu, adpcm, statemate, ran out of memory forsmaller values of ⌧ . For the tightest precision requirement,✏ = 32, we report the maximum required number of sampless
max32
(⌧) = max
P2P s
P
(32, ⌧) over all benchmarks for whichthe analysis terminated successfully, for ⌧ between 8192 and262144. For the loosest precision requirement, ✏ = 1024, onthe other hand, we report the minimum required number ofsamples s
min1024
(⌧) over all benchmarks. For ✏ between 32 and1024, s
P
(✏, ⌧) is expected to lie between s
min1024
(⌧) and s
max32
(⌧).We also report the median number of samples s
median256
(⌧) overall benchmarks for ✏ = 256. For most benchmarks, the numberof required samples is quite insensitive to ✏: the median for✏ = 256 is close to the minimum for ✏ = 1024.
VIII. APPLICATION TO COMMERCIALMICROARCHITECTURES
We have applied APTA to a parameterized version ofthe PTARM, a precision-timed architecture. It has beendesigned with the specific goal of reconciling performanceand predictability. And indeed, as we demonstrate, precise andefficient architecture-parametric WCET analysis is feasible forthe PTARM. Naturally, the question arises whether our approachis also applicable to other existing academic and commercial
4Where (⌧, ⌧) refers to the monotone parameters µI-SPM-Size and µD-SPM-Size.
262144 131072 65536 32768 16384 81920
50
100
excl. nsichneu
excl. adpcm, statemate
Precision Requirement ⌧
Num
ber
ofsa
mpl
es
Fig. 6. Number of samples required to reach precision guarantee (✏, (⌧, ⌧)).Some benchmarks, namely nsichneu, adpcm, statemate, ran out ofmemory for ✏ = 32 and are thus not included in smax
32 (⌧). The median andminimum, smedian
256 (⌧) and smin1024(⌧), could be determined for all values of ⌧ .
microarchitectures and whether a similar parameterization isreasonable for such microarchitectures.
To answer the second question first, we believe that the pa-rameterization of the PTARM is quite typical for both commer-cial and research platforms: aspects that are often configurablein single-core processors are the processor frequency, the sizesof local memories (caches or scratchpads), and the interconnect,affecting memory access latencies, corresponding directly to theparameters in the PTARM. To use multi-core architectures in ahard real-time context, most of their shared resources will haveto be partitioned (an approach promoted among others in theMERASA and Predator projects [15], [3]) in space and/or time:caches can be partitioned along their ways [16], buses and otherinterconnect can be partitioned in time by time division multipleaccess (TDMA) arbitration [17], and access to DRAM memorycan be partitioned in time [18] and space [10]. Partitioningin space intuitively induces monotone parameters, whereastime-based partitioning may sometimes be modeled with linearparameters, however, caveats exist, as discussed below.
For our approach to be applicable to a particular platform,its parameterized timing model needs to be linear or at leastmonotone in all of its parameters. This is, unfortunately, not thecase for canonical models of many complex microarchitectures.In contrast to LRU, FIFO cache replacement is known tosuffer from Belady’s anomaly, i.e., under FIFO, increasingthe cache’s size may lead to a decrease in performance. Forplatforms including such non-monotone features, monotoneparameterized timing models can be developed, however, atthe cost of a loss in precision. Alternatively, if the goal is tofind a system configuration statically, our analysis may alsobe performed on timing models that are not guaranteed to bemonotone. This yields parametric WCET estimations that arenot necessarily safe. Once a system configuration has beendetermined based on such an estimation, the black box canstill be used to verify whether the timing constraints can bemet. If not, the parametric WCET estimation can be refinedaccordingly and the process would have to be iterated.
In addition to monotonicity, our approach requires timing tobe decomposed into contributions that can be attributed to differ-ent components. Such a decomposition is natural for so-calledfully timing-compositional architectures [3], [19]. An exampleof a fully timing-compositional commercial architecture is theARM7 [3]. For more complex architectures such as the InfineonTriCore or the PowerPC 755, it is possible to create conservativecompositional models. The resulting loss in precision may, how-ever, be substantial and depends on the particular architectureand decomposition [19], and is a topic of ongoing research.
E x p e r i m e n t a l E v a l u a t i o n : A n a l y s i s T i m e i n Te r m s o f N u m b e r o f S a m p l e s
18
2 3 4 5 6 7 8 9 10111213141516171819202122232425260
1
2
Number of samplesR
atio
Fig. 4. Ratios between upper bound and black box roveri , rover
i,random and ratiobetween lower bound and black box runder
i in terms of the number of samples i.
10 20 30 400
10,000
20,000
Number of samples
Ana
lysi
sTi
me
(inm
s)
PIPLIB
AST OperationsBlack Box
Fig. 5. Analysis time in terms of number of samples.
observed in Figure 4, 16 samples usually result in very preciseparametric upper bounds. On our benchmarks, 16 samples areprocessed in less than 2.2 seconds on the average.
In Figure 4 we have shown how the actual precisiondepends on the number of samples. Next, we determinehow many samples are required to meet a certain precisionguarantee. To this end, we determine for each benchmarkand for several precision targets (✏, (⌧, ⌧))
4 the number ofsamples s
P
(✏, ⌧) that Algorithm 3 takes until termination.We summarize these values in Figure 6. Three benchmarks,nsichneu, adpcm, statemate, ran out of memory forsmaller values of ⌧ . For the tightest precision requirement,✏ = 32, we report the maximum required number of sampless
max32
(⌧) = max
P2P s
P
(32, ⌧) over all benchmarks for whichthe analysis terminated successfully, for ⌧ between 8192 and262144. For the loosest precision requirement, ✏ = 1024, onthe other hand, we report the minimum required number ofsamples s
min1024
(⌧) over all benchmarks. For ✏ between 32 and1024, s
P
(✏, ⌧) is expected to lie between s
min1024
(⌧) and s
max32
(⌧).We also report the median number of samples s
median256
(⌧) overall benchmarks for ✏ = 256. For most benchmarks, the numberof required samples is quite insensitive to ✏: the median for✏ = 256 is close to the minimum for ✏ = 1024.
VIII. APPLICATION TO COMMERCIALMICROARCHITECTURES
We have applied APTA to a parameterized version ofthe PTARM, a precision-timed architecture. It has beendesigned with the specific goal of reconciling performanceand predictability. And indeed, as we demonstrate, precise andefficient architecture-parametric WCET analysis is feasible forthe PTARM. Naturally, the question arises whether our approachis also applicable to other existing academic and commercial
4Where (⌧, ⌧) refers to the monotone parameters µI-SPM-Size and µD-SPM-Size.
262144 131072 65536 32768 16384 81920
50
100
excl. nsichneu
excl. adpcm, statemate
Precision Requirement ⌧
Num
ber
ofsa
mpl
es
Fig. 6. Number of samples required to reach precision guarantee (✏, (⌧, ⌧)).Some benchmarks, namely nsichneu, adpcm, statemate, ran out ofmemory for ✏ = 32 and are thus not included in smax
32 (⌧). The median andminimum, smedian
256 (⌧) and smin1024(⌧), could be determined for all values of ⌧ .
microarchitectures and whether a similar parameterization isreasonable for such microarchitectures.
To answer the second question first, we believe that the pa-rameterization of the PTARM is quite typical for both commer-cial and research platforms: aspects that are often configurablein single-core processors are the processor frequency, the sizesof local memories (caches or scratchpads), and the interconnect,affecting memory access latencies, corresponding directly to theparameters in the PTARM. To use multi-core architectures in ahard real-time context, most of their shared resources will haveto be partitioned (an approach promoted among others in theMERASA and Predator projects [15], [3]) in space and/or time:caches can be partitioned along their ways [16], buses and otherinterconnect can be partitioned in time by time division multipleaccess (TDMA) arbitration [17], and access to DRAM memorycan be partitioned in time [18] and space [10]. Partitioningin space intuitively induces monotone parameters, whereastime-based partitioning may sometimes be modeled with linearparameters, however, caveats exist, as discussed below.
For our approach to be applicable to a particular platform,its parameterized timing model needs to be linear or at leastmonotone in all of its parameters. This is, unfortunately, not thecase for canonical models of many complex microarchitectures.In contrast to LRU, FIFO cache replacement is known tosuffer from Belady’s anomaly, i.e., under FIFO, increasingthe cache’s size may lead to a decrease in performance. Forplatforms including such non-monotone features, monotoneparameterized timing models can be developed, however, atthe cost of a loss in precision. Alternatively, if the goal is tofind a system configuration statically, our analysis may alsobe performed on timing models that are not guaranteed to bemonotone. This yields parametric WCET estimations that arenot necessarily safe. Once a system configuration has beendetermined based on such an estimation, the black box canstill be used to verify whether the timing constraints can bemet. If not, the parametric WCET estimation can be refinedaccordingly and the process would have to be iterated.
In addition to monotonicity, our approach requires timing tobe decomposed into contributions that can be attributed to differ-ent components. Such a decomposition is natural for so-calledfully timing-compositional architectures [3], [19]. An exampleof a fully timing-compositional commercial architecture is theARM7 [3]. For more complex architectures such as the InfineonTriCore or the PowerPC 755, it is possible to create conservativecompositional models. The resulting loss in precision may, how-ever, be substantial and depends on the particular architectureand decomposition [19], and is a topic of ongoing research.
16 Samples: ~ 2.2 seconds
C o n c l u s i o n s a n d F u t u r e W o r k
19
First general framework for architecture-parametric timing analysis.
!
Future Work: • Parametric schedulability analysis • Integrate into a design-space exploration • Study applicability to commercial microarchitectures • “White-box” approach
C o n c l u s i o n s a n d F u t u r e W o r k
19
First general framework for architecture-parametric timing analysis.
!
Future Work: • Parametric schedulability analysis • Integrate into a design-space exploration • Study applicability to commercial microarchitectures • “White-box” approach
Thank you for your attention!