processor architecture.zmp

Upload: amrendra-kumar-mishra

Post on 03-Apr-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

  • 7/29/2019 Processor Architecture.zmp

    1/8

    1. CISC vs. RISC Architecture:

    RISC architectures lend themselves more towards pipelining than CISC architectures formany reasons. As RISC architectures have a smaller set of instructions than CISC

    architectures in a pipeline architecture the time required to fetch and decode for CISC

    architectures is unpredictable. The difference in instruction length with CISC will hinder the

    fetch decode sections of a pipeline, a single byte instruction following an 8 byte instruction

    will need to be handled so as not to slow down the whole pipeline. In RISC architectures thefetch and decode cycle is more predictable and most instructions have similar length.

    CISC architectures by their very name also have more complex instructions with complex

    addressing modes. This makes the whole cycle of processing an instruction more complex.

    Pipelining requires that the whole fetch to execute cycle can be split into stages where eachstage does not interfere with the next and each instruction can be doing something at each

    stage. RISC architectures because of their simplicity and small set of instructions are simple

    to split into stages. CISC with more complex instructions are harder to split into stages.

    Stages that are important for one instruction may be not be required for another instructionwith CISC.

    The rich set of addressing modes that are available in CISC architectures can cause data

    hazards when pipelining is introduced. Data hazards which are unlikely to occur in RISC

    architectures due to the smaller subset of instructions and the use of load store instruction to

    access memory become a problem in CISC architectures. CISC instructions that write resultsback to memory need to be handled carefully. Forwarding solutions used for allowing result

    written to registers to be available for input in the next instruction become more complex

    when memory locations which can be addressed in various modes can be accessed. Writeafter Read hazard must be taken care of where CISC instruction may auto increment a register

    early in the stages which may be used by the previous instruction at a later stage.

    CISC added complexity makes for larger pipeline lengths to take into account more decodingand checking. Using the Larson and Davidson equation from their paper Cost-effective

    Design of Special-Purpose Processors: A Fast Fourier Transform Case Study for calculatingthe optimum number of pipeline stages for a processor it can be shown RISC architectures

    suite smaller pipelines. Keeping the values for instruction stream length and logic gates for

    stages static it can be shown that the optimum pipeline length increases with the size of the

    fetch execute cycle.

    This is because with a large number of logic gates for a fetch execute cycle the additional

    gates required for stages have less impact. As RISC architectures have simpler instruction setsthan CISC the number of gates involved in the fetch execute cycle compare will be far lower

    than this in CISC architecture. Therefore RISC architectures will tend to have smalleroptimum pipeline lengths than more general processors.

    RISC architectures do suite pipelining more than CISC architectures and do lend themselves

    to smaller pipelines. This does not mean however that CISC architecture can not gain frompipelining or that a large number of pipeline stages are bad (although the flushing of a

    pipeline would become of concern).

    Other features, which are typically found in RISC architectures are:

    Uniform instruction format (fixed size instructions), using a single word with theopcode in the same bit positions in every instruction, demanding less decoding;Instructions are executed in a single clock cycle.

  • 7/29/2019 Processor Architecture.zmp

    2/8

    Execution unit is much faster due to simple and uniform instructions Large number of general purpose registers, to avoid storing variables in a stack

    memory.

    Only load and store instructions to refer memory. Fewer simple instruction rather than complex ones. Simple and less number ofaddressing modes to simplify reference to operands. Few data types in hardware, some CISCs have byte string instructions, or support

    complex numbers; this is so far unlikely to be found on a RISC.

    2. Von NeuMANN and Harvard Architecture:

    The von Neumann architecture is a design model for a stored-program digital computerthatuses a processing unit and a single separate storage structure to hold both instructions and

    data. A single bus used to transfer instructions and data, leads to the Von Neumann bottleneck.

    This limits throughput (data transfer rate) between the CPU and memory. This seriously limits

    the effective processing speed when the CPU is required to perform minimal processing onlarge amounts of data. The CPU is continuously forced to wait for needed data to be

    transferred to or from memory. Since CPU speed and memory size have increased muchfaster than the throughput between them, the bottleneck has become more of a problem.

    Figure 1: Von Neumann architecture

    The most obvious characteristic of the Harvard Architecture is that it has physically separate

    signals and storage for code and data memory. It is possible to access program memory anddata memory simultaneously, thereby creating potentially faster throughput and less of a

    bottleneck. Typically, code (or program) memory is read-only and data memory is read-write.Therefore, it is impossible for program contents to be modified by the program itself. In a

    computer using the Harvard architecture, the CPU can both read an instruction and perform a

    data memory access at the same time, even without a cache. A Harvard architecture computercan thus be faster for a given circuit complexity because instruction fetches and data access do

    not contend for a single memory pathway.

    Figure 2: Harvard architecture

    2.1 SIMD Processing:

    Some DSPs have multiple data memories in distinct address spaces to facilitate SIMD andVLIW processing. SIMD exploits data-level parallelism by operating on small to moderate

    http://en.wikipedia.org/wiki/General_purpose_registershttp://en.wikipedia.org/wiki/Addressing_modehttp://en.wikipedia.org/wiki/Bytehttp://en.wikipedia.org/wiki/String_(computer_science)http://en.wikipedia.org/wiki/Complex_numberhttp://en.wikipedia.org/wiki/Digital_computerhttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Data_(computing)http://en.wikipedia.org/wiki/Throughputhttp://en.wikipedia.org/wiki/Wait_statehttp://en.wikipedia.org/wiki/SIMDhttp://en.wikipedia.org/wiki/VLIWhttp://en.wikipedia.org/wiki/VLIWhttp://en.wikipedia.org/wiki/SIMDhttp://en.wikipedia.org/wiki/Wait_statehttp://en.wikipedia.org/wiki/Throughputhttp://en.wikipedia.org/wiki/Data_(computing)http://en.wikipedia.org/wiki/Computer_storagehttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Digital_computerhttp://en.wikipedia.org/wiki/Complex_numberhttp://en.wikipedia.org/wiki/String_(computer_science)http://en.wikipedia.org/wiki/Bytehttp://en.wikipedia.org/wiki/Addressing_modehttp://en.wikipedia.org/wiki/General_purpose_registers
  • 7/29/2019 Processor Architecture.zmp

    3/8

    number of data items in parallel. The True SIMD architecture contains a single contol

    unit(CU) with multiple processing elements(PE) acting as arithmetic units(AU). In thissituation, the arithmetic units are slaves to the control unit. The AU's cannot fetch or interpret

    any instructions. They are merely a unit which has capabilities of addition, subtraction,

    multiplication, and division. Each AU has access only to its own memory. In this sense, if a

    AU needs the information contained in a different AU, it must put in a request to the CU andthe CU must manage the transferring of information. The advantage of this type of

    architecture is in the ease of adding more memory and AU's to the computer.

    Figure 3: SIMD processing

    The disadvantage can be found in the time wasted by the CU managing all memory

    exchanges.

    Not all algorithms can be vectorized. For example, a flow-control-heavy task like codeparsing wouldn't benefit from SIMD

    Currently, implementing an algorithm with SIMD instructions usually requires humanlabor; most compilers don't generate SIMD instructions from a typical Cprogram, for

    instance. Vectorization in compilers is an active area of computer science research.

    (Compare vector processing.)

    SIMD computers require less hardware than MIMD computers (single control unit).

    However, SIMD processors are specially designed, and tend to be expensive and havelong design cycles.

    Not all applications are naturally suited to SIMD processing. Conceptually, MIMD computers cover SIMD need.

    2.1 MIMD Processing:

    A MIMD computer has many interconnected processing elements, each of which have their

    own control unit, see fig 4. The processing unit works on their own data with their own

    instructions. Tasks executed by different processing units can start or finish at different times.

    They may send results to central location and may share memory space. They are not lock-stepped, as in SIMD computers, but run asynchronously.

    http://en.wikipedia.org/wiki/Parsinghttp://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/Vectorization_(computer_science)http://en.wikipedia.org/wiki/Vector_processorhttp://en.wikipedia.org/wiki/Vector_processorhttp://en.wikipedia.org/wiki/Vectorization_(computer_science)http://en.wikipedia.org/wiki/C_(programming_language)http://en.wikipedia.org/wiki/Parsing
  • 7/29/2019 Processor Architecture.zmp

    4/8

    Figure 4: MIMD processing

    3. Very long instruction word (VLIW) Processors:

    It refers to a CPU architecture designed to take advantage of instruction level parallelism(ILP). A processor that executes every instruction one after the other (i.e. a non-pipelined

    scalar architecture) may use processor resources inefficiently, potentially leading to poor

    performance. The performance can be improved by

    executing different sub-steps of sequential instructions simultaneously (this ispipelining), or

    executing multiple instructions entirely simultaneously as insuperscalararchitectures.Further improvement can be achieved by executing instructions in an order different from the

    order they appear in the program; this is called out-of-order execution.

    As often implemented, these techniques all come at a cost: increased hardware complexity.

    Before executing any operations in parallel, the processor must verify that the instructions donot have interdependencies. For example a first instruction's result is used as a second

    instruction's input. Clearly, they cannot execute at the same time, and the second instruction

    can't be executed before the first. Modern out-of-order processors have increased the

    hardware resources which do the scheduling of instructions and determining ofinterdependencies. Determining the order of execution of operations (including whichoperations can execute simultaneously) is handled by the compiler, the processor does not

    need the scheduling hardware that the three techniques described above require. As a result,

    VLIW CPUs offer significant computational power with less hardware complexity (butgreater compiler complexity) than is associated with most superscalar CPUs.

    3.1 Superscalar:

    CPU architecture implements a form ofparallelism called instruction level parallelism withina single processor. It therefore allows faster CPU throughput than would otherwise be

    possible at a given clock rate. A superscalar processor executes more than one instruction

    during a clock cycle by simultaneously dispatching multiple instructions to redundant

    functional units on the processor. Each functional unit is not a separate CPU core but an

    http://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Instruction_level_parallelismhttp://en.wikipedia.org/wiki/Pipelininghttp://en.wikipedia.org/wiki/Superscalarhttp://en.wikipedia.org/wiki/Superscalarhttp://en.wikipedia.org/wiki/Superscalarhttp://en.wikipedia.org/wiki/Out-of-order_executionhttp://en.wikipedia.org/wiki/Dependence_analysishttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Parallel_computerhttp://en.wikipedia.org/wiki/Instruction_level_parallelismhttp://en.wikipedia.org/wiki/Throughputhttp://en.wikipedia.org/wiki/Clock_ratehttp://en.wikipedia.org/wiki/Clock_ratehttp://en.wikipedia.org/wiki/Throughputhttp://en.wikipedia.org/wiki/Instruction_level_parallelismhttp://en.wikipedia.org/wiki/Parallel_computerhttp://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Dependence_analysishttp://en.wikipedia.org/wiki/Out-of-order_executionhttp://en.wikipedia.org/wiki/Superscalarhttp://en.wikipedia.org/wiki/Pipelininghttp://en.wikipedia.org/wiki/Instruction_level_parallelismhttp://en.wikipedia.org/wiki/Central_processing_unit
  • 7/29/2019 Processor Architecture.zmp

    5/8

    execution resource within a single CPU such as an arithmetic logic unit, a bit shifter, or a

    multiplier.

    Figure 5: Superscalar architecture employing ILP

    While a superscalar CPU is typically also pipelined, pipelining and superscalar architectureare considered different performance enhancement techniques. The superscalar technique is

    traditionally associated with several identifying characteristics (within a given CPU core):

    Instructions are issued from a sequential instruction stream CPU hardware dynamically checks fordata dependenciesbetween instructions at run

    time (versus software checking at compile time) The CPU accepts multiple instructions per clock cycle

    3.2 Pipeline

    An instruction pipeline is a technique used in the design of computers and other digitalelectronic devices to increase theirinstruction throughput(the number of instructions that can

    be executed in a unit of time). The fundamental idea is to split the processing of a computer

    instruction into a series of independent steps, with storage at the end of each step. This allows

    the computer's control circuitry to issue instructions at the processing rate of the slowest step,which is much faster than the time needed to perform all steps at once. Fig.6 shows a pipeline

    system of five stages (Fetch, Decode, Memory access, Execute and Write Back) for

    processing instruction in microprocessor. After 1st

    three clock cycle, we get one processedinstruction at every clock cycle. The speed of pipeline can be measure in terms of CPI (clock

    cycles per instructions) which should be ideally 1. But due to latency and pipeline hazards

    (will be explained later) the value of CPI is less than one in all practical cases.

    http://en.wikipedia.org/wiki/Arithmetic_logic_unithttp://en.wikipedia.org/wiki/Multiplication_ALUhttp://en.wikipedia.org/wiki/Instruction_pipelinehttp://en.wikipedia.org/wiki/Data_dependencieshttp://en.wikipedia.org/wiki/Compile_timehttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Computerhttp://en.wikipedia.org/wiki/Compile_timehttp://en.wikipedia.org/wiki/Data_dependencieshttp://en.wikipedia.org/wiki/Instruction_pipelinehttp://en.wikipedia.org/wiki/Multiplication_ALUhttp://en.wikipedia.org/wiki/Arithmetic_logic_unit
  • 7/29/2019 Processor Architecture.zmp

    6/8

    Figure 6: Pipeline operation for five stages

    Assume all stages in pipeline havesame delay tCthen

    Pipeline clock cycle time = tCLet there are kstages and n instructions to be executed

    Processing time: Tk= [k+(n-1)] tC (for pipeline processor)(for non-pipeline processor T1 = nktC)

    Speed up factor: Sk= T1 / Tk= nk /{k+(n-1)}

    Efficiency:Ek= Sk/ k= n /{k+(n-1)}

    Pipeline throughput = No. of instructions executed per unit time= n /{k+(n-1)}tC

    If delays of stages of pipeline are unequal:

    Let delay of largest delay stage= tcw = pipeline clock cycle time

    Processing time to execure n instructions is

    Tk= [k+(n-1)] tcw . . . . (pipeline)

    T1 = n(d1 + d2 + . . . + dk) . . . (Non-pipeline)

    Ideally, if there N stages in a pipeline architecture then

    Instruction Execution time (Pipeline) = Instr. Execution time (non-pipeline) / N

    Most modern CPUs are driven by a clock. The CPU consists internally of logic and memory

    (flip flops). When the clock signal arrives, the flip flops take their new value and the logicthen requires a period of time to decode the new values. Then the next clock pulse arrives and

    the flip flops again take their new values, and so on. By breaking the logic into smaller pieces

    and inserting flip flops between the pieces of logic, the delay before the logic gives valid

    outputs is reduced. In this way the clock period can be reduced. For example, the classic RISC

    pipeline is broken into five stages with a set of flip flops between each stage.

    1. Instruction fetch2. Instruction decode and register fetch3. Execute4. Memory access5. Register write back

    http://en.wikipedia.org/wiki/Central_processing_unithttp://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Classic_RISC_pipelinehttp://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Central_processing_unit
  • 7/29/2019 Processor Architecture.zmp

    7/8

    1. Instruction fetch

    The Instruction fetch on these machines had a latency of one cycle. During the InstructionFetch stage, a 32-bit instruction was fetched from the cache. The PC predictor sends the

    Program Counter (PC) to the Instruction Cache to read the current instruction. At the same

    time, the PC predictor predicts the address of the next instruction by incrementing the PC by 4

    (all instructions were 4 bytes long). This prediction was always wrong in the case of a takenbranch, jump, or exception (see delayed branches, below). Later machines would use more

    complicated and accurate algorithms (branch prediction and branch target prediction) to guessthe next instruction address.

    2. Decode

    Unlike earlier microcoded machines, the first RISC machines had no microcode. Once

    fetched from the instruction cache, the instruction bits were shifted down the pipeline, so that

    simple combinational logic in each pipeline stage could produce the control signals for the

    datapath directly from the instruction bits. As a result, very little decoding is done in the stagetraditionally called the decode stage.

    If the instruction decoded was a branch or jump, the target address of the branch or jump wascomputed in parallel with reading the register file. The branch condition is computed after the

    register file is read, and if the branch is taken or if the instruction is a jump, the PC predictor

    in the first stage is assigned the branch target, rather than the incremented PC that has beencomputed.

    3. Execute

    Instructions on these simple RISC machines can be divided into three latency classes

    according to the type of the operation:

    Register-Register Operation (Single cycle latency): Add, subtract, compare, andlogical operations. During the execute stage, the two arguments were fed to a simpleALU, which generated the result by the end of the execute stage.

    Memory Reference (Two cycle latency). All loads from memory. During the executestage, the ALU added the two arguments (a register and a constant offset) to produce a

    virtual address by the end of the cycle.

    Multi Cycle Instructions (Many cycle latency). Integer multiply and divide and allfloating-point operations. During the execute stage, the operands to these operations

    were fed to the multi-cycle multiply/divide unit. The rest of the pipeline was free to

    continue execution while the multiply/divide unit did its work. To avoid complicating

    http://en.wikipedia.org/wiki/Program_Counterhttp://en.wikipedia.org/wiki/Branch_predictionhttp://en.wikipedia.org/wiki/Branch_target_predictorhttp://en.wikipedia.org/wiki/Arithmetic_logic_unithttp://en.wikipedia.org/wiki/Floating-pointhttp://en.wikipedia.org/wiki/Floating-pointhttp://en.wikipedia.org/wiki/Arithmetic_logic_unithttp://en.wikipedia.org/wiki/Branch_target_predictorhttp://en.wikipedia.org/wiki/Branch_predictionhttp://en.wikipedia.org/wiki/Program_Counter
  • 7/29/2019 Processor Architecture.zmp

    8/8

    the writeback stage and issue logic, multicycle instruction wrote their results to a

    separate set of registers.

    4. Memory Access

    During this stage, results of data processing instruction produced by execute stage is

    forwarded to the next stage. If the instruction is a load, the data is read from the data cache or

    data memory and if the instructions is store, the register data is written to data memory at

    address computed by execute stage.

    5. Writeback

    During this stage, both single cycle and two cycle instructions write their results into the

    register file.

    Pipeline Hazards:

    > Data hazards are when an instruction, scheduled blindly, would attempt to use data beforethe data is available in the register file. (e.g. an instruction depends on the results of a

    previous instruction.)

    > Control hazard occurs whenever there is a change in the normal execution flow of the

    program. Events such as branches, interrupts, exceptions and return from interrupts. A hazard

    occurs because branches, interrupts etc are not caught until the instruction is executed. By thetime it is executed, the following instructions are already entered into the pipeline and need to

    be flushed out.

    > Structural hazards are when two instructions might attempt to use the same resources at the

    same time. Classic RISC pipelines avoided these hazards by replicating hardware. In

    particular, branch instructions could have used the ALU to compute the target address of the

    branch.

    Advantages of Pipelining:

    1. The cycle time of the processor is reduced, thus increasing instruction issue-rate inmost cases.

    2. Some combinational circuits such as adders or multipliers can be made faster byadding more circuitry. If pipelining is used instead, it can save circuitry vs. a more

    complex combinational circuit.

    Disadvantages of Pipelining:

    1. A non-pipelined processor executes only a single instruction at a time. This preventsbranch delays (in effect, every branch is delayed) and problems with serial instructions

    being executed concurrently. Consequently the design is simpler and cheaper tomanufacture.

    2. The instruction latency in a non-pipelined processor is slightly lower than in apipelined equivalent. This is due to the fact that extra flip flops must be added to the

    data path of a pipelined processor.3. A non-pipelined processor will have a stable instruction bandwidth. The performance

    of a pipelined processor is much harder to predict and may vary more widely between

    different programs.

    http://en.wikipedia.org/wiki/Flip-flop_(electronics)http://en.wikipedia.org/wiki/Flip-flop_(electronics)