chapter 5 multiprocessors using timed petri netswlodek/pdf/18-intech.pdf · in this chapter, timed...

Chapter 5

Performance Analysis of Shared-Memory Bus-BasedMultiprocessors Using Timed Petri Nets

Wlodek M. Zuberek

Additional information is available at the end of the chapter

http://dx.doi.org/10.5772/intechopen.75589

Provisional chapter

Performance Analysis of Shared-Memory Bus-BasedMultiprocessors Using Timed Petri Nets

Wlodek M. Zuberek

Additional information is available at the end of the chapter

Abstract

In shared-memory bus-based multiprocessors, the number of processors is often limitedby the (shared) bus; when the utilization of the bus approaches 100%, processors spend anincreasing amount of time waiting to get access to the bus (and shared memory) and thisdegrades their performance. The limitations imposed by the bus depend upon manyparameters, and different parameters affect the performance in different ways. This chap-ter uses timed Petri nets to model shared-memory bus-based multiprocessors at theinstruction execution level and shows how the performance of processors and the systemare affected by different modeling parameters. Discrete-event simulation of the developednet models is used to get performance results.

Keywords: shared-memory multiprocessors, bus-based multiprocessors, timed Petri nets,performance analysis, discrete-event simulation

1. Introduction

Due to continuous progress in manufacturing technologies, the performance of microproces-sors has been steadily improving over several decades, doubling every 18 months (the so-called Moore’s law [1]). The capacity of memory chips has also been doubling every 18 months,but the performance has been improving less than 10% per year [2]. The performance gap [3]between the processor and its memory have been doubling approximately every 6 years, andan increasing part of the processor’s time is being spent on waiting for the completion ofmemory operations. Although multilevel cache memories are used to reduce the averagelatencies of memory accesses, matching the performances of the processor and the memory isan increasingly difficult task. In effect, it is often the case that more than 50% of processor

© 2016 The Author(s). Licensee InTech. This chapter is distributed under the terms of the Creative Commons

Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,

distribution, and eproduction in any medium, provided the original work is properly cited.

DOI: 10.5772/intechopen.75589

© 2018 The Author(s). Licensee IntechOpen. This chapter is distributed under the terms of the CreativeCommons Attribution License (http://creativecommons.org/licenses/by/3.0), which permits unrestricted use,distribution, and reproduction in any medium, provided the original work is properly cited.

cycles are spent waiting for the completion of memory accesses [4]. A model of a pipelinedprocessor at the instruction execution level is used in this chapter to study the mismatch ofprocessor and memory performances.

This model of a processor is then used for performance analysis of shared-memory bus-basedmultiprocessors. The main objective of this analysis is to study the degradation of the pro-cessor’s performance when the utilization of the (shared) bus approaches 100%. This perfor-mance degradation limits the number of processors in bus-based systems.

Modeling and analysis of shared-memory bus-based systems requires a flexible formalism thatcan easily handle concurrent activities as well as synchronization of different events andprocesses that occur in such systems. Petri nets [5, 6] are such formal models.

As formal models, Petri nets are bipartite directed graphs in which the two types of verticesrepresent, in a very general sense, conditions and events. An event can occur only when allconditions associated with it (represented by arcs directed to the event) are satisfied. Anoccurrence of an event usually satisfies some other conditions, indicated by arcs directed fromthe event. Hence, an occurrence of one event causes some other event (or events) to occur, andso on.

In order to study the performance aspects of systems modeled by Petri nets, the durations ofmodeled activities must also be taken into account. This can be done in different ways,resulting in different types of temporal nets [7]. In timed Petri nets [8], occurrence times areassociated with events, and the events occur in real time (as opposed to instantaneous occur-rences in other models). For timed nets with constant or exponentially distributed occurrencetimes, the state graph of a net is a Markov chain (or an embedded Markov chain). If the statespace of a timed net is finite and reasonably small, the stationary probabilities of states can bedetermined by standard methods [9]. Then these stationary probabilities are used for thederivation of many performance characteristics of the model [10]. In other cases, discrete eventsimulation [11] is used to find the performance characteristics of a timed net.

In this chapter, timed Petri nets are used to model shared-memory bus-based multiprocessorsystems. Section 2 recalls basic concepts of Petri nets and timed Petri nets. Section 3 discusses amodel of a pipelined processor and its performance as a function of modeling parameters.Shared-memory bus-based systems are described and analyzed in Section 4. Section 5 concludesthe chapter.

2. Timed Petri nets

In Petri nets, concurrent activities are represented by tokens that can move within a (static)graph-like structure of the net. More formally, a marked place/transition Petri netM is definedas a pairM ¼ N ;m0ð Þ, where the structure N is a bipartite directed graph, N ¼ P;T;Að Þ, withtwo types of vertices, a set of places P and a set of transitions T, and a set of directed arcs Aconnecting places with transitions and transitions with places, A ⊆ P�T ∪ T�P. The initial

Petri Nets in Science and Engineering76

marking function m0 assigns nonnegative numbers of tokens to places of the net,m0 : P ! 0; 1;…f g. Marked nets can be equivalently defined as M ¼ P;T;A;m0ð Þ.A place is shared if it is connected to more than one transition. A shared place p is free-choice ifthe sets of places connected by directed arcs to all transitions sharing p are identical. A sharedplace p is (dynamically) conflict-free if for each marking reachable from the initial marking, atmost one transition sharing p is enabled. If a shared place p is not free-choice and not conflict-free, the transitions sharing p are conflicting.

In timed nets [8], occurrence times are associated with transitions, and transition occurrencesare real-time events, i.e., tokens are removed from input places at the beginning of the occur-rence period and are deposited to the output places at the end of this period. All occurrences ofenabled transitions are initiated in the same instants of time in which the transitions becomeenabled (although some enabled transitions may not initiate their occurrences). If, during theoccurrence period of a transition, the transition becomes enabled again, a new, independentoccurrence can be initiated, which will overlap with the other occurrence(s). There is no limiton the number of simultaneous occurrences of the same transition (sometimes this is calledinfinite occurrence semantics). Similarly, if a transition is enabled “several times” (i.e., itremains enabled after initiating an occurrence), it may start several independent occurrencesin the same time instant.

More formally, a timed Petri net is a triple, T ¼ M; c; fð Þ, where M is a marked net, c is achoice function which assigns probabilities to transitions in free-choice classes, or relativefrequencies of occurrences to conflicting transitions, c ! 0; 1½ �, and f is a timing function,which assigns an (average) occurrence time to each transition of the net, f : T ! Rþ, whereRþ is the set of nonnegative real numbers.

The occurrence times of transitions can be either deterministic or stochastic (i.e., described bysome probability distribution function). In the first case, the corresponding timed nets arereferred to as D-timed nets [12]; in the second, for the (negative) exponential distribution ofoccurrence times, the nets are called M-timed nets (Markovian nets) [13]. In both cases, theconcepts of state and state transitions have been formally defined and used in the derivation ofdifferent performance characteristics of the model. In simulation applications, other distribu-tions can also be used, for example, the uniform distribution (U-timed nets) is sometimes aconvenient option. In timed Petri nets, different distributions can be associated with differenttransitions in the same model providing flexibility that is used in simulation examples thatfollow.

In timed nets, the occurrence times of some transitions may be equal to zero, which means thatthe occurrences are instantaneous; all such transitions are called immediate (while the othersare called timed). Since immediate transitions have no tangible effects on the (timed) behaviorof the model, it is convenient to “split” the set of transitions into two parts, the set of immedi-ate and the set of timed transitions, and to first perform all occurrences of the (enabled)immediate transitions, and then (still in the same time instant), when no more immediatetransitions are enabled, to start the occurrences of (enabled) timed transitions. It should benoted that such a convention effectively introduces the priority of immediate transitions over

Performance Analysis of Shared-Memory Bus-Based Multiprocessors Using Timed Petri Netshttp://dx.doi.org/10.5772/intechopen.75589

77

the timed ones, so the conflicts of immediate and timed transitions are not allowed in timednets. Detailed characterization of the behavior of timed nets with immediate and timed transi-tions is given in [8].

3. Pipelined processors

A timed Petri net model of a pipelined processor [14] at the level of instruction execution isshown in Figure 1 (as usual, timed transitions are represented by solid bars, and immediatetransitions by thin bars). It is assumed that the first level cache does not delay the processor,while cache misses (at the first level cache) introduce a delay of tc processor cycles. Forsimplicity, only two levels of cache memory are represented in the model; it appears that sucha simplification does not affect the results in a significant way [15].

Place Pnxt is marked when the processor is ready to execute the next instruction. Pnxt is a free-choice place with three possible outcomes that model issuing an instruction without anyfurther delay (Ts0 with the choice probability ps0), a single-cycle pipeline stall (modeled byTd1 with the choice probability ps1 associated with Ts1), and a two-cycle pipeline stall(modeled by Td2 and then Td1 with the choice probability ps2 assigned to Ts2). Other pipelinestalls could be represented in a similar way, if needed.

Marked place Cont indicates that an instruction is ready to be issued to the execution pipeline.It is assumed that once the instruction enters the pipeline, it will progress through all the stagesand, eventually, leave the pipeline. Since the details of pipeline implementation are not impor-tant for performance analysis of the processor, they are not represented here. Only the firststage of the execution pipeline is shown as timed transition Trun.

Done is another free-choice place which determines if the executing instruction results in acache miss or not. Transition Tnxt occurs (with the corresponding probability) if cache missdoes not occur and the processor can continue fetching and issuing instructions. Cache miss isrepresented by Tsel. The choice probability associated with Tsel determines the instruction

Figure 1. Instruction-level Petri net model of a pipelined processor.


runlength, nℓ, i.e., the average number of instructions between two consecutive cache misses; ifthis choice probability is equal to 0.1, the runlength is equal to 10; if it is equal to 0.2, therunlength is 5; and so on.

Psel is another free-choice place; it models the hits and misses of the second-level cache. Theprobability associated with transition Tloc represents the hit ratio of the second-level cache (theoccurrence time of Tloc is the average access time to the second-level cache, tc) while the missratio is associated with transition Tmem which represents accesses to the main memory (withthe occurrence time tm).

Typical values of modeling parameters used in this chapter are shown in Table 1.

All temporal data in Table 1 (i.e., cache and memory access times) are in processor cycles.

Processor utilization as a function of h1, the hit rate of the first-level cache, is shown in Figure 2for two values of the second-level cache access time, tc ¼ 5, and tc ¼ 10. It should not be

Symbol Parameter Value

h1 First-level cache hit rate 0.9

h2 Second-level cache hit rate 0.8

tp First-level cache access time 1

tc Second-level cache access time 5

tm Main memory access time 25

ps1 Prob. of one-cycle pipeline stall 0.1

ps2 Prob. of two-cycle pipeline stall 0.05

Table 1. Modeling parameters and their typical values.

Figure 2. Processor utilization as a function of first-level cache hit rate for h2 ¼ 0:8, ps ¼ 0:2.


79

surprising that processor utilization is quite sensitive to the values of h1, but is much lesssensitive to the values of tc.

Processor utilization as a function of h2, the hit rate of the second-level cache, is shown inFigure 3 for two values of the main memory access time, tm ¼ 25 and tm ¼ 50. Processorutilization is rather insensitive to values of h2, and does not change much with tm.

Processor utilization as a function of the probability of pipeline stalls ps ¼ ps1 þ 2ps2 is shownin Figure 4 for three combinations of values of tc and tm.

Again, processor utilization is rather insensitive to the probability of pipeline stalls as well asthe values of tc and tm.

Figure 3. Processor utilization as a function of second-level cache hit rate for h1 ¼ 0:9, ps ¼ 0:2.

Figure 4. Processor utilization as a function of probability of pipeline stalls for h1 ¼ 0:9, h2 ¼ 0:8.


For pipelined processors shown in Figure 1, processor utilization can be estimated using thefollowing formula:

up ¼ 11þ ps1 þ 2∗ps2 þ 1� h1ð Þ∗ tc þ 1� h2ð Þ∗tmð Þ : (1)

For the values of modeling parameters shown in Table 1, processor utilization is:

up ¼ 11þ 0:1þ 0:1þ 0:1∗ 5þ 0:2∗25ð Þ ≈ 0:45: (2)

The estimated values agree quite well with the values shown in Figures 2–4.

4. Shared-memory bus–based systems

An outline of a shared-memory bus-based multiprocessor is shown in Figure 5. The system iscomposed of n identical processors, which access the shared memory using a system bus. Toreduce the average access time to the shared memory, the processors use (multilevel) cachememories. It is assumed that memory consistency is provided by a cache coherence mecha-nism [16], which usually increases the miss ratio of accessing caches (and is otherwise notrepresented in the model).

A timed Petri net model of a shared-memory bus-based multiprocessor is shown in Figure 6. Itcontains models of n processors (only two are shown in Figure 6), which are copies of themodel shown in Figure 1 except for the main memory (transition Tmem) which becomesshared memory in Figure 6. The remaining part of Figure 6 is modeling the bus that coordi-nates accesses of processors to the shared memory.

Figure 5. A shared-memory bus-based multiprocessor.


81

When a processor i, i ¼ 1,…, n, requests an access to shared memory, place Pri becomesmarked. If the bus is available (i.e., if place Bus is marked), the occurrence of transition Taiindicates that processor i begins its access to shared memory. Transitions Ta1, ..., Tan consti-tute a conflict class with a fair resolution of conflicts (i.e., all conflicting processors have thesame probability of being selected for accessing memory). In real systems, accessing the sharedbus is often based on priorities assigned to processors; such priorities could easily berepresented using inhibitor arcs in Petri nets.

Place Pmem collects memory access requests from all processors (occurrences of transitionsTai). The end of memory access (i.e., the end of the occurrence of Tmem) is indicated by anoccurrence of transition Tei of the processor which initiated memory access. The occurrence ofTei also returns a token to Bus, allowing another access to shared memory to be executed.

Figure 7 shows the utilization of processors and the bus as functions of the number of pro-cessors in a shared-memory system for the values of modeling parameters shown in Table 1.

Figure 6. A timed Petri net model of bus-based shared-memory multiprocessor.


In Figure 7, the bus utilization approaches 100% for about five processors. Moreover, thedegradation of processors’ performance due to increasing waiting times for accessing the bus(and shared memory) is well illustrated in Figure 7.

The average waiting time (in processor cycles) of accessing shared memory (i.e., the times fromrequesting memory access in place Pri to granting this access by an occurrence of Tai) is shownin Figure 8 as a function of the number of processors in the system.

Figure 8 shows that the waiting times increase almost linearly with the number of processorswhen this number is greater than 5, i.e., when the bus (and shared memory) is utilized inalmost 100%.

Figure 7. Processor and bus utilization as functions of the number of processors for h1 ¼ 0:9, h2 ¼ 0:8, ps ¼ 0:2.

Figure 8. The average waiting time for accessing shared memory as a function of the number of processors for h1 ¼ 0:9,h2 ¼ 0:8, ps ¼ 0:2.


83

If the value of the second-level cache hit rate, h2, increases (and the other parameters do notchange), the number of accesses to main memory is reduced, so that the performances ofprocessors and the whole system improve. Figure 9 shows the utilization of processors andthe bus as functions of the number of processors in the system for h2 ¼ 0:9. It also shows thatthe reduced (in comparison with Figure 7) utilization of the bus allows the increase of thenumber of processors without significant degradation of their performance.

By a similar argument, reduced hit rate at the first-level cache, h1, increases the number ofaccesses to the second-level cache as well as to the main memory, and this results in reducedperformance of the system. Figure 10 shows the utilization of processors and the bus asfunctions of the number of processors in the system for h1 ¼ 0:8. It provides a good illustrationof the degradation of performance when compared with Figure 7.




The number of processors for which the bus is used to almost 100% can be estimated by thefollowing formula:

np ¼ 1þ ps1 þ 2∗ps2 þ 1� h1ð Þ∗ tc þ 1� h2ð Þ∗tmð Þ1� h1ð Þ∗ 1� h2ð Þ∗tm : (3)

For the case shown in Figure 7, this number is:

1þ 0:1þ 0:1þ 0:1∗ 5þ 0:2∗25ð Þ0:1∗0:2∗25

¼ 1:2þ 1:00:5

¼ 4:4: (4)

For the case shown in Figure 9, this value is 8.2.

There are several ways in which the number of processors can be increased in bus-basedsystems without sacrificing the processors’ performance. The simplest approach is to introducethe second bus which allows two concurrent accesses to shared memory, provided the mem-ory is dual port (it allows two concurrent accesses). Figure 11 outlines a dual bus shared-memory system.

Petri net model of a dual bus system is the same as in Figure 6; the only difference is the initialmarking of place Bus, which now requires two tokens to represent two concurrent accesses toshared memory. It should be observed observed that, for a small number of processors, theutilization of each bus in Figure 12 is one half of that in Figure 7, and also the number ofprocessors that can be used in such a dual bus system without degradation of their perfor-mance is twice as large as in a single bus system (Figure 7).

The results shown in Figure 12 are very similar to those shown in Figure 9. The second bus in ashared-memory system allows to perform two concurrent accesses to the shared memory.From a single processor’s performance point of view, this effect is similar to reducing two

Figure 11. A dual bus shared-memory multiprocessor.


85

times the number of accesses to the shared memory, and this is the effect of reducing two timesthe miss rate for the second-level cache (which is shown in Figure 9).

If dual port memory cannot be used, the shared memory can be split into several independentmodules which can be accessed concurrently by the processors provided that the bus is alsosplit into sections associated with each module, with processors accessing all such sections, asshown in Figure 13 for four independent memory modules. The main difference between amultibus system (Figure 11) and a system with split bus is in accessing the shared memory; ina multiple bus system, the whole shared memory is accessed by each bus, while in a split bussystem (Figure 13), each section of the bus accesses only one memory module. In the system

Figure 12. Processor and bus utilization as functions of the number of processors—dual bus system with h1 ¼ 0:9,h2 ¼ 0:8, ps ¼ 0:2.

Figure 13. A shared-memory multiprocessor with multiple memory modules.


Figu

re14.Pe

trin

etmod

elof

ashared

-mem

orymultip

rocessor

with

multip

lemem

orymod

ules.


87

shown in Figure 13, up to four (the number of memory modules) memory accesses can beperformed concurrently, but if two (or more) processors request access to the same memorymodule, the requests are served one after another.

Petri net model of a system outlined in Figure 13 is shown in Figure 14 where only twoprocessors and two memory modules are detailed.

In Figure 14, there is a free-choice place Pri for each processor i, i ¼ 1,…, n. This free-choiceplace selects the requested memory module by transitions Tij, j ¼ 1,…, 4, and forwards thememory access request to the selected memory module (place Pij). If the selected module isavailable, i.e., if place Busj is marked, the access to shared memory is initiated by the occur-rence of Taij. When this memory access is completed, the occurrence of Teij releases thememory modules (by returning a token to Busj) and resumes instruction execution in theprocessor that requested the memory access.

If memory module is not available when it is requested, the memory access is delayed (in Pij)until the requested module becomes available.

It is possible that more than one processor becomes waiting for the same memory module. Theselection of the processor which will get access first is random with the same probabilityassigned to all waiting processors. In real systems, there is usually some priority scheme thatdetermines the order in which the waiting processors access the bus. Such a priority schemecould easily be modeled if it is needed (for example, for studying the starvation effect whichcan be created when the system is overloaded).

In Figure 14, the selection of memory modules is random, with the same probabilities for allmodules. If this policy is not realistic, a different memory accessing policy can be implemented,

Figure 15. Processor and bus utilization as functions of the number of processors—system with four memory modulesand h1 ¼ 0:9, h2 ¼ 0:8, ps ¼ 0:2.


for example, the probabilities of accessing consecutive memory modules by each processorcould be used to model sequential processing of large arrays, and so on.

Figure 15 shows the utilization of processors and busses as functions of the number of pro-cessors in a system outlined in Figure 13.

In Figure 15, even for 20 processors, the average utilization of the bus is close to 80%, so thesystem can accommodate more processors.

5. Concluding remarks

The chapter uses timed Petri nets to model shared-memory bus-based architectures at the level ofinstruction execution to study the effects of modeling parameters on the performance of thesystem. Themodels are rather simple with straightforward representation of modeling parameters.

Performance results presented in this chapter have been obtained by the simulation of devel-oped Petri net models. However, the model shown in Figure 7 has only 10 states, so itsanalytical solution (for different values of modeling parameters) can be easily obtained andcompared with simulation results to verify their accuracy. Table 2 shows such a comparison ofprocessor utilization for several combinations of parameters h1 and h2. In all cases, thesimulation-based results are very close to the analytical ones.

The models of multiprocessor systems are usually composed of many copies of the samesubmodel of a processor and possibly other elements of the system. Colored Petri nets [17]can significantly simplify such models by eliminating copies of similar subsystems. Analysis ofcolored Petri nets is, however, much more complex than that of ordinary Petri nets.

Finally, it should be noted that the performance of real-life multiprocessor systems very rarelycan be described by a set of parameters that remain stable for any significant period of time.The basic parameters like the hit rates depend upon the executed programs as well as theirdata, and can change very quickly in a significant way. Consequently, the characteristicspresented in this chapter can only be used as some insight into the complex behavior ofmultiprocessor systems.

h1 h2 Simulated results Analytical results

0.8 0.8 0.3341 0.3333

0.8 0.9 0.3846 0.3846

0.9 0.8 0.4763 0.4762

0.9 0.9 0.5255 0.5263

Table 2. A comparison of simulation and analytical results.


89

Author details

Wlodek M. Zuberek

Address all correspondence to: [email protected]

Department of Computer Science, Memorial University, Canada

References

[1] Hamilton S. Taking Moore’s law into the next century. IEEE Computer. 1999;32(1):43-48

[2] Patterson DA, Hennessy JL. Computer Architecture—A Quantitative Approach. 4th ed.San Mateo, CA: Morgan Kaufmann; 2006

[3] Wilkes MV. The memory gap and the future of high-performance memories. ACM Archi-tecture News. 2001;29(1):2-7

[4] Mutlu O, Stark J, Wilkerson C, Patt YN. Runahead execution: An effective alternative tolarge instruction windows. IEEE Micro. 2003;23(6):20-25

[5] Murata T. Petri nets: Properties, analysis and applications. Proceedings of IEEE. 1989;77(4):541-580

[6] Reisig W. Petri nets—An introduction (EATCS Monographs on Theoretical ComputerScience. Vol. 4). New York, NY: Springer-Verlag; 1985

[7] Popova-Zeugmann L. Time and Petri Nets. Berlin, Heidelberg: Springer-Verlag; 2013

[8] Zuberek WM. Timed petri nets—Definitions, properties and applications. Microelectron-ics and Reliability (Special Issue on Petri Nets and Related Graph Models). 1991;31(4):627-644

[9] Allen AA. Probability, Statistics and Queueing Theory with Computer Science Applica-tions. 2nd ed. San Diego, CA: Academic Press; 1991

[10] Jain R. The Art of Computer Systems Performance Analysis. New York, NY: Wiley Inter-science; 1991

[11] Pooch UW, Wall JA. Discrete Event Simulation. Boca Raton, FL: CRC Press; 1993

[12] Zuberek WM. D-timed petri nets and modelling of timeouts and protocols. Transactionsof the Society for Computer Simulation. 1987;4(4):331-357

[13] Zuberek WM. M-timed petri nets, priorities, preemptions, and performance evaluation ofsystems. In: Advances in Petri Nets 1985 (LNCS 222). Berlin, Heidelberg: Springer-Verlag;1986. pp. 478-498

[14] Ramamoorthy CV, Li HF. Pipeline architecture. ACM Computing Surveys. 1977;9(1):61-102


[15] Zuberek WM. Modeling and analysis of simultaneous multithreading. In: Proceedings ofthe 14th International Conference on Analytical and Stochastic Modeling Techniques andApplications (ASMTA-07), a part of the 21st European Conference on Modeling andSimulation (ECMS’07); Prague, Czech Republic; 2007. pp. 115-120

[16] Suh T, Lee H-HS, Blough DM. Integrating cache coherence protocols for heterogeneousmultiprocessor system. Part 2. IEEE Micro. 2004;24(5):55-69

[17] Jensen K, Kristensen LM. Coloured Petri Nets—Modeling and Validation of ConcurrentSystems. Berlin, Heidelberg: Springer-Verlag; 2009


91

chapter 5 multiprocessors using timed petri netswlodek/pdf/18-intech.pdf · in this chapter, timed...

Documents