processor architecture.zmp

7/29/2019 Processor Architecture.zmp

1/8

1. CISC vs. RISC Architecture:

RISC architectures lend themselves more towards pipelining than CISC architectures formany reasons. As RISC architectures have a smaller set of instructions than CISC

architectures in a pipeline architecture the time required to fetch and decode for CISC

architectures is unpredictable. The difference in instruction length with CISC will hinder the

fetch decode sections of a pipeline, a single byte instruction following an 8 byte instruction

will need to be handled so as not to slow down the whole pipeline. In RISC architectures thefetch and decode cycle is more predictable and most instructions have similar length.

CISC architectures by their very name also have more complex instructions with complex

addressing modes. This makes the whole cycle of processing an instruction more complex.

Pipelining requires that the whole fetch to execute cycle can be split into stages where eachstage does not interfere with the next and each instruction can be doing something at each

stage. RISC architectures because of their simplicity and small set of instructions are simple

to split into stages. CISC with more complex instructions are harder to split into stages.

Stages that are important for one instruction may be not be required for another instructionwith CISC.

The rich set of addressing modes that are available in CISC architectures can cause data

hazards when pipelining is introduced. Data hazards which are unlikely to occur in RISC

architectures due to the smaller subset of instructions and the use of load store instruction to

access memory become a problem in CISC architectures. CISC instructions that write resultsback to memory need to be handled carefully. Forwarding solutions used for allowing result

written to registers to be available for input in the next instruction become more complex

when memory locations which can be addressed in various modes can be accessed. Writeafter Read hazard must be taken care of where CISC instruction may auto increment a register

early in the stages which may be used by the previous instruction at a later stage.

CISC added complexity makes for larger pipeline lengths to take into account more decodingand checking. Using the Larson and Davidson equation from their paper Cost-effective

Design of Special-Purpose Processors: A Fast Fourier Transform Case Study for calculatingthe optimum number of pipeline stages for a processor it can be shown RISC architectures

suite smaller pipelines. Keeping the values for instruction stream length and logic gates for

stages static it can be shown that the optimum pipeline length increases with the size of the

fetch execute cycle.

This is because with a large number of logic gates for a fetch execute cycle the additional

gates required for stages have less impact. As RISC architectures have simpler instruction setsthan CISC the number of gates involved in the fetch execute cycle compare will be far lower

than this in CISC architecture. Therefore RISC architectures will tend to have smalleroptimum pipeline lengths than more general processors.

RISC architectures do suite pipelining more than CISC architectures and do lend themselves

to smaller pipelines. This does not mean however that CISC architecture can not gain frompipelining or that a large number of pipeline stages are bad (although the flushing of a

pipeline would become of concern).

Other features, which are typically found in RISC architectures are:

Uniform instruction format (fixed size instructions), using a single word with theopcode in the same bit positions in every instruction, demanding less decoding;Instructions are executed in a single clock cycle.


2/8

Execution unit is much faster due to simple and uniform instructions Large number of general purpose registers, to avoid storing variables in a stack

memory.

Only load and store instructions to refer memory. Fewer simple instruction rather than complex ones. Simple and less number ofaddressing modes to simplify reference to operands. Few data types in hardware, some CISCs have byte string instructions, or support

complex numbers; this is so far unlikely to be found on a RISC.

2. Von NeuMANN and Harvard Architecture:

The von Neumann architecture is a design model for a stored-program digital computerthatuses a processing unit and a single separate storage structure to hold both instructions and

data. A single bus used to transfer instructions and data, leads to the Von Neumann bottleneck.

This limits throughput (data transfer rate) between the CPU and memory. This seriously limits

the effective processing speed when the CPU is required to perform minimal processing onlarge amounts of data. The CPU is continuously forced to wait for needed data to be

transferred to or from memory. Since CPU speed and memory size have increased muchfaster than the throughput between them, the bottleneck has become more of a problem.

Figure 1: Von Neumann architecture

The most obvious characteristic of the Harvard Architecture is that it has physically separate

signals and storage for code and data memory. It is possible to access program memory anddata memory simultaneously, thereby creating potentially faster throughput and less of a

bottleneck. Typically, code (or program) memory is read-only and data memory is read-write.Therefore, it is impossible for program contents to be modified by the program itself. In a

computer using the Harvard architecture, the CPU can both read an instruction and perform a

data memory access at the same time, even without a cache. A Harvard architecture computercan thus be faster for a given circuit complexity because instruction fetches and data access do

not contend for a single memory pathway.

Figure 2: Harvard architecture

2.1 SIMD Processing:

Some DSPs have multiple data memories in distinct address spaces to facilitate SIMD andVLIW processing. SIMD exploits data-level parallelism by operating on small to moderate
http://en.wikipedia.org/wiki/General_purpose_registershttp://en.wikipedia.org/wiki/Addressing_modehttp://en.wikipedia.org/wiki/Bytehttp://en.wikipedia.org/wiki/String_(computer_science)http://en.wikipedia.org/wiki/Complex_numberhttp://en.wikipedia.org/wiki/Digital_computerhttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Data_(computing)http://en.wikipedia.org/wiki/Throughputhttp://en.wikipedia.org/wiki/Wait_statehttp://en.wikipedia.org/wiki/SIMDhttp://en.wikipedia.org/wiki/VLIWhttp://en.wikipedia.org/wiki/VLIWhttp://en.wikipedia.org/wiki/SIMDhttp://en.wikipedia.org/wiki/Wait_statehttp://en.wikipedia.org/wiki/Throughputhttp://en.wikipedia.org/wiki/Data_(computing)http://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Digital_computerhttp://en.wikipedia.org/wiki/Complex_numberhttp://en.wikipedia.org/wiki/String_(computer_science)http://en.wikipedia.org/wiki/Bytehttp://en.wikipedia.org/wiki/Addressing_modehttp://en.wikipedia.org/wiki/General_purpose_registers


3/8

number of data items in parallel. The True SIMD architecture contains a single contol

unit(CU) with multiple processing elements(PE) acting as arithmetic units(AU). In thissituation, the arithmetic units are slaves to the control unit. The AU's cannot fetch or interpret

any instructions. They are merely a unit which has capabilities of addition, subtraction,

multiplication, and division. Each AU has access only to its own memory. In this sense, if a

AU needs the information contained in a different AU, it must put in a request to the CU andthe CU must manage the transferring of information. The advantage of this type of

architecture is in the ease of adding more memory and AU's to the computer.

Figure 3: SIMD processing

The disadvantage can be found in the time wasted by the CU managing all memory

exchanges.

Not all algorithms can be vectorized. For example, a flow-control-heavy task like codeparsing wouldn't benefit from SIMD

Currently, implementing an algorithm with SIMD instructions usually requires humanlabor; most compilers don't generate SIMD instructions from a typical Cprogram, for

instance. Vectorization in compilers is an active area of computer science research.

(Compare vector processing.)

SIMD computers require less hardware than MIMD computers (single control unit).

However, SIMD processors are specially designed, and tend to be expensive and havelong design cycles.

Not all applications are naturally suited to SIMD processing. Conceptually, MIMD computers cover SIMD need.

2.1 MIMD Processing:

A MIMD computer has many interconnected processing elements, each of which have their

own control unit, see fig 4. The processing unit works on their own data with their own

instructions. Tasks executed by different processing units can start or finish at different times.

They may send results to central location and may share memory space. They are not lock-stepped, as in SIMD computers, but run asynchronously.
http://en.wikipedia.org/wiki/Parsinghttp://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/Vectorization_(computer_science)http://en.wikipedia.org/wiki/Vector_processorhttp://en.wikipedia.org/wiki/Vector_processorhttp://en.wikipedia.org/wiki/Vectorization_(computer_science)http://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/Parsing


4/8

Figure 4: MIMD processing

3. Very long instruction word (VLIW) Processors:

It refers to a CPU architecture designed to take advantage of instruction level parallelism(ILP). A processor that executes every instruction one after the other (i.e. a non-pipelined

scalar architecture) may use processor resources inefficiently, potentially leading to poor

performance. The performance can be improved by

executing different sub-steps of sequential instructions simultaneously (this ispipelining), or

executing multiple instructions entirely simultaneously as insuperscalararchitectures.Further improvement can be achieved by executing instructions in an order different from the

order they appear in the program; this is called out-of-order execution.

As often implemented, these techniques all come at a cost: increased hardware complexity.

Before executing any operations in parallel, the processor must verify that the instructions donot have interdependencies. For example a first instruction's result is used as a second

instruction's input. Clearly, they cannot execute at the same time, and the second instruction

can't be executed before the first. Modern out-of-order processors have increased the

hardware resources which do the scheduling of instructions and determining ofinterdependencies. Determining the order of execution of operations (including whichoperations can execute simultaneously) is handled by the compiler, the processor does not

need the scheduling hardware that the three techniques described above require. As a result,

VLIW CPUs offer significant computational power with less hardware complexity (butgreater compiler complexity) than is associated with most superscalar CPUs.

3.1 Superscalar:

CPU architecture implements a form ofparallelism called instruction level parallelism withina single processor. It therefore allows faster CPU throughput than would otherwise be

possible at a given clock rate. A superscalar processor executes more than one instruction

during a clock cycle by simultaneously dispatching multiple instructions to redundant

functional units on the processor. Each functional unit is not a separate CPU core but an
http://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Instruction_level_parallelismhttp://en.wikipedia.org/wiki/Pipelininghttp://en.wikipedia.org/wiki/Superscalarhttp://en.wikipedia.org/wiki/Superscalarhttp://en.wikipedia.org/wiki/Superscalarhttp://en.wikipedia.org/wiki/Out-of-order_executionhttp://en.wikipedia.org/wiki/Dependence_analysishttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Parallel_computerhttp://en.wikipedia.org/wiki/Instruction_level_parallelismhttp://en.wikipedia.org/wiki/Throughputhttp://en.wikipedia.org/wiki/Clock_ratehttp://en.wikipedia.org/wiki/Clock_ratehttp://en.wikipedia.org/wiki/Throughputhttp://en.wikipedia.org/wiki/Instruction_level_parallelismhttp://en.wikipedia.org/wiki/Parallel_computerhttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Dependence_analysishttp://en.wikipedia.org/wiki/Out-of-order_executionhttp://en.wikipedia.org/wiki/Superscalarhttp://en.wikipedia.org/wiki/Pipelininghttp://en.wikipedia.org/wiki/Instruction_level_parallelismhttp://en.wikipedia.org/wiki/Central_processing_unit


5/8

execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a

multiplier.

Figure 5: Superscalar architecture employing ILP

While a superscalar CPU is typically also pipelined, pipelining and superscalar architectureare considered different performance enhancement techniques. The superscalar technique is

traditionally associated with several identifying characteristics (within a given CPU core):

Instructions are issued from a sequential instruction stream CPU hardware dynamically checks fordata dependenciesbetween instructions at run

time (versus software checking at compile time) The CPU accepts multiple instructions per clock cycle

3.2 Pipeline

An instruction pipeline is a technique used in the design of computers and other digitalelectronic devices to increase theirinstruction throughput(the number of instructions that can

be executed in a unit of time). The fundamental idea is to split the processing of a computer

instruction into a series of independent steps, with storage at the end of each step. This allows

the computer's control circuitry to issue instructions at the processing rate of the slowest step,which is much faster than the time needed to perform all steps at once. Fig.6 shows a pipeline

system of five stages (Fetch, Decode, Memory access, Execute and Write Back) for

processing instruction in microprocessor. After 1st

three clock cycle, we get one processedinstruction at every clock cycle. The speed of pipeline can be measure in terms of CPI (clock

cycles per instructions) which should be ideally 1. But due to latency and pipeline hazards

(will be explained later) the value of CPI is less than one in all practical cases.
http://en.wikipedia.org/wiki/Arithmetic_logic_unithttp://en.wikipedia.org/wiki/Multiplication_ALUhttp://en.wikipedia.org/wiki/Instruction_pipelinehttp://en.wikipedia.org/wiki/Data_dependencieshttp://en.wikipedia.org/wiki/Compile_timehttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Compile_timehttp://en.wikipedia.org/wiki/Data_dependencieshttp://en.wikipedia.org/wiki/Instruction_pipelinehttp://en.wikipedia.org/wiki/Multiplication_ALUhttp://en.wikipedia.org/wiki/Arithmetic_logic_unit


6/8

Figure 6: Pipeline operation for five stages

Assume all stages in pipeline havesame delay tCthen

Pipeline clock cycle time = tCLet there are kstages and n instructions to be executed

Processing time: Tk= [k+(n-1)] tC (for pipeline processor)(for non-pipeline processor T1 = nktC)

Speed up factor: Sk= T1 / Tk= nk /{k+(n-1)}

Efficiency:Ek= Sk/ k= n /{k+(n-1)}

Pipeline throughput = No. of instructions executed per unit time= n /{k+(n-1)}tC

If delays of stages of pipeline are unequal:

Let delay of largest delay stage= tcw = pipeline clock cycle time

Processing time to execure n instructions is

Tk= [k+(n-1)] tcw . . . . (pipeline)

T1 = n(d1 + d2 + . . . + dk) . . . (Non-pipeline)

Ideally, if there N stages in a pipeline architecture then

Instruction Execution time (Pipeline) = Instr. Execution time (non-pipeline) / N

Most modern CPUs are driven by a clock. The CPU consists internally of logic and memory

(flip flops). When the clock signal arrives, the flip flops take their new value and the logicthen requires a period of time to decode the new values. Then the next clock pulse arrives and

the flip flops again take their new values, and so on. By breaking the logic into smaller pieces

and inserting flip flops between the pieces of logic, the delay before the logic gives valid

outputs is reduced. In this way the clock period can be reduced. For example, the classic RISC

pipeline is broken into five stages with a set of flip flops between each stage.

1. Instruction fetch2. Instruction decode and register fetch3. Execute4. Memory access5. Register write back
http://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Central_processing_unit


7/8

1. Instruction fetch

The Instruction fetch on these machines had a latency of one cycle. During the InstructionFetch stage, a 32-bit instruction was fetched from the cache. The PC predictor sends the

Program Counter (PC) to the Instruction Cache to read the current instruction. At the same

time, the PC predictor predicts the address of the next instruction by incrementing the PC by 4

(all instructions were 4 bytes long). This prediction was always wrong in the case of a takenbranch, jump, or exception (see delayed branches, below). Later machines would use more

complicated and accurate algorithms (branch prediction and branch target prediction) to guessthe next instruction address.

2. Decode

Unlike earlier microcoded machines, the first RISC machines had no microcode. Once

fetched from the instruction cache, the instruction bits were shifted down the pipeline, so that

simple combinational logic in each pipeline stage could produce the control signals for the

datapath directly from the instruction bits. As a result, very little decoding is done in the stagetraditionally called the decode stage.

If the instruction decoded was a branch or jump, the target address of the branch or jump wascomputed in parallel with reading the register file. The branch condition is computed after the

register file is read, and if the branch is taken or if the instruction is a jump, the PC predictor

in the first stage is assigned the branch target, rather than the incremented PC that has beencomputed.

3. Execute

Instructions on these simple RISC machines can be divided into three latency classes

according to the type of the operation:

Register-Register Operation (Single cycle latency): Add, subtract, compare, andlogical operations. During the execute stage, the two arguments were fed to a simpleALU, which generated the result by the end of the execute stage.

Memory Reference (Two cycle latency). All loads from memory. During the executestage, the ALU added the two arguments (a register and a constant offset) to produce a

virtual address by the end of the cycle.

Multi Cycle Instructions (Many cycle latency). Integer multiply and divide and allfloating-point operations. During the execute stage, the operands to these operations

were fed to the multi-cycle multiply/divide unit. The rest of the pipeline was free to

continue execution while the multiply/divide unit did its work. To avoid complicating
http://en.wikipedia.org/wiki/Program_Counterhttp://en.wikipedia.org/wiki/Branch_predictionhttp://en.wikipedia.org/wiki/Branch_target_predictorhttp://en.wikipedia.org/wiki/Arithmetic_logic_unithttp://en.wikipedia.org/wiki/Floating-pointhttp://en.wikipedia.org/wiki/Floating-pointhttp://en.wikipedia.org/wiki/Arithmetic_logic_unithttp://en.wikipedia.org/wiki/Branch_target_predictorhttp://en.wikipedia.org/wiki/Branch_predictionhttp://en.wikipedia.org/wiki/Program_Counter


8/8

the writeback stage and issue logic, multicycle instruction wrote their results to a

separate set of registers.

4. Memory Access

During this stage, results of data processing instruction produced by execute stage is

forwarded to the next stage. If the instruction is a load, the data is read from the data cache or

data memory and if the instructions is store, the register data is written to data memory at

address computed by execute stage.

5. Writeback

During this stage, both single cycle and two cycle instructions write their results into the

register file.

Pipeline Hazards:

> Data hazards are when an instruction, scheduled blindly, would attempt to use data beforethe data is available in the register file. (e.g. an instruction depends on the results of a

previous instruction.)

> Control hazard occurs whenever there is a change in the normal execution flow of the

program. Events such as branches, interrupts, exceptions and return from interrupts. A hazard

occurs because branches, interrupts etc are not caught until the instruction is executed. By thetime it is executed, the following instructions are already entered into the pipeline and need to

be flushed out.

> Structural hazards are when two instructions might attempt to use the same resources at the

same time. Classic RISC pipelines avoided these hazards by replicating hardware. In

particular, branch instructions could have used the ALU to compute the target address of the

branch.

Advantages of Pipelining:

1. The cycle time of the processor is reduced, thus increasing instruction issue-rate inmost cases.

2. Some combinational circuits such as adders or multipliers can be made faster byadding more circuitry. If pipelining is used instead, it can save circuitry vs. a more

complex combinational circuit.

Disadvantages of Pipelining:

1. A non-pipelined processor executes only a single instruction at a time. This preventsbranch delays (in effect, every branch is delayed) and problems with serial instructions

being executed concurrently. Consequently the design is simpler and cheaper tomanufacture.

2. The instruction latency in a non-pipelined processor is slightly lower than in apipelined equivalent. This is due to the fact that extra flip flops must be added to the

data path of a pipelined processor.3. A non-pipelined processor will have a stable instruction bandwidth. The performance

of a pipelined processor is much harder to predict and may vary more widely between

different programs.
http://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Flip-flop_(electronics)

processor architecture.zmp

Documents