arm cortex a8 pipeline ee126 wei wang. cortex a8 is a processor core designed by arm holdings....

ARM Cortex A8 Pipeline

EE126 Wei Wang

• Cortex A8 is a processor core designed by ARM Holdings. • Application: Apple A4, Samsung Exynos 3110.

What’s the pipeline architecture in Cortex A8?

Deeper pipeline and superscalar pipeline.

Deeper Pipeline

For pipeline, the speed is limited by the length of the longest stage, and the longest stage is set to be the standard one cycle time. For the deeper pipeline, the time of the new sub-stage is small. The smaller time resolution therefore leads to less time to complete one instruction.

IF ID EXE

F0 F1 F2 D0 D1 D2 D3 D4 E0 E1 E2 E3 E4 E5

IF ID EXE

Why does it break one cycle into several cycles?

F0 F1 F2 D0 D1 D2 D3 D4 E0 E1 E2 E3 E4 E5

0 1 2 3 4 5 6 7 8 9

Simple 4 Stage Pipeline

Superscalar Pipeline

Two instructions executed at the same time

IF ID EX WB

Superscalar Pipeline• It is a form of instruction level parallelism, which is faster than normal pipeline.

Cortex A8 Pipeline Main Architecture:

Instruction Fetch

Instruction Decode

Instruction Execute and Load/Store

Arch

itecture

Reg

ister File

ALU Pipeline0

Integer ALU Pipeline

MUL Pipeline0

ALU Pipeline1

Load/Store Pipeline0/1

InInteger register writeback

F0 F1 F2 D0 D1 D2 D3 D4 E0 E1 E2 E3 E4 E5

NEON Instruction

DecodeN

EON

Register File

Integer MUL Pipeline

Integer shift Pipeline

Non-IEEE FP ADD Pipeline

Non-IEEE FP MUL Pipeline

IEEE FP Engine

Load/Store Permute Pipeline

NEON register writeback

NEON

14-Stage Integer Pipeline

M0 M1 M2 M3 N1 N2 N3 N4 N5 N6

10-Stage NEON Pipeline

Load/Store Data

Quence

NEON Store Data

• Execution stages: 6 stage pipeline.

Instruction Execute and Load/Store

Architecture Register File

Shift

Load/Store Pipeline

InInteger register writeback

ALU+Flags

Sat BPUpdate WB

MUL1

MUL2

MUL3

ACC WB

ShiftALU+Flags

Sat BPUpdate WB

AGU WB

ALU Pipeline

ALU Pipeline

Multiple Pipeline

Load/Store Pipeline

E0 E1 E2 E3 E4 E5

• It can extensively support of key forwarding path. Result data is from the outputs of shift, ALU and MUL immediately as it is produced. The intermediate execution stage results can be forwarded. Unlike the simple pipeline, only the final execution stage result can be forwarded.

Two symmetric ALU pipeline, a multiple pipeline and an address generator for load and store.

1. For the ALU pipeline:

E0 access register file;

E1 shift if needed;

E2 ALU function;

E3 complete saturation if needed;

E4 change in control flow;

E5 write back to register file.

2. For the Mul pipeline:

E1-E3 implement multiply;

E4 perform addition.;

E5 write back.

• Deep pipeline and superscalar pipeline have good performance. Why not increases the sub-stages and the parallel instructions?

• What’s the limitations?

• Data Dependency

0 1 2 3 4 5

Add BUBBLE

BUBBLE

Data Independency

Data Dependency

Solution: Stall the adder until the multiplier has finished.

MUL t3,t2,t1ADD t6, t5,t4

MUL t3,t2,t1ADD t6, t3,t4

• Output dependency:

• An output dependency occurs if two paralleled instructions are writing into the same location. An error occurs if the second instruction implement before the first one.

MUL t3,t2,t1;ADD t3,t4,t5;

• Antidependency:

• An antidependency exists if an instruction uses a location as an operand while a following one is writing into that location; if the first one is still using the location when the second one writes into it, an error occurs.


• Solution for the output independency and antidependency: Use other register.





Alternative ways to handle dependency:Compiler will generate instructions with less dependency.

• Summary: Cortex architecture is a high speed architecture by using deeper pipeline and superscalar pipeline.

Thank you

arm cortex a8 pipeline ee126 wei wang. cortex a8 is a processor core designed by arm holdings....

Documents

stage pipeline

pipeline architecture

superscalar pipeline

deeper pipeline

mul pipeline

multiple pipeline

deep pipeline

normal pipeline