arm cortex a8 pipeline ee126 wei wang. cortex a8 is a processor core designed by arm holdings....
TRANSCRIPT
ARM Cortex A8 Pipeline
EE126 Wei Wang
• Cortex A8 is a processor core designed by ARM Holdings. • Application: Apple A4, Samsung Exynos 3110.
What’s the pipeline architecture in Cortex A8?
Deeper pipeline and superscalar pipeline.
Deeper Pipeline
For pipeline, the speed is limited by the length of the longest stage, and the longest stage is set to be the standard one cycle time. For the deeper pipeline, the time of the new sub-stage is small. The smaller time resolution therefore leads to less time to complete one instruction.
IF ID EXE
F0 F1 F2 D0 D1 D2 D3 D4 E0 E1 E2 E3 E4 E5
IF ID EXE
Why does it break one cycle into several cycles?
F0 F1 F2 D0 D1 D2 D3 D4 E0 E1 E2 E3 E4 E5
0 1 2 3 4 5 6 7 8 9
Simple 4 Stage Pipeline
Superscalar Pipeline
Two instructions executed at the same time
IF ID EX WB
Superscalar Pipeline• It is a form of instruction level parallelism, which is faster than normal pipeline.
Cortex A8 Pipeline Main Architecture:
Instruction Fetch
Instruction Decode
Instruction Execute and Load/Store
Arch
itecture
Reg
ister File
ALU Pipeline0
Integer ALU Pipeline
MUL Pipeline0
ALU Pipeline1
Load/Store Pipeline0/1
InInteger register writeback
F0 F1 F2 D0 D1 D2 D3 D4 E0 E1 E2 E3 E4 E5
NEON Instruction
DecodeN
EON
Register File
Integer MUL Pipeline
Integer shift Pipeline
Non-IEEE FP ADD Pipeline
Non-IEEE FP MUL Pipeline
IEEE FP Engine
Load/Store Permute Pipeline
NEON register writeback
NEON
14-Stage Integer Pipeline
M0 M1 M2 M3 N1 N2 N3 N4 N5 N6
10-Stage NEON Pipeline
Load/Store Data
Quence
NEON Store Data
• Execution stages: 6 stage pipeline.
Instruction Execute and Load/Store
Architecture Register File
Shift
Load/Store Pipeline
InInteger register writeback
ALU+Flags
Sat BPUpdate WB
MUL1
MUL2
MUL3
ACC WB
ShiftALU+Flags
Sat BPUpdate WB
AGU WB
ALU Pipeline
ALU Pipeline
Multiple Pipeline
Load/Store Pipeline
E0 E1 E2 E3 E4 E5
• It can extensively support of key forwarding path. Result data is from the outputs of shift, ALU and MUL immediately as it is produced. The intermediate execution stage results can be forwarded. Unlike the simple pipeline, only the final execution stage result can be forwarded.
Two symmetric ALU pipeline, a multiple pipeline and an address generator for load and store.
1. For the ALU pipeline:
E0 access register file;
E1 shift if needed;
E2 ALU function;
E3 complete saturation if needed;
E4 change in control flow;
E5 write back to register file.
2. For the Mul pipeline:
E1-E3 implement multiply;
E4 perform addition.;
E5 write back.
• Deep pipeline and superscalar pipeline have good performance. Why not increases the sub-stages and the parallel instructions?
• What’s the limitations?
• Data Dependency
0 1 2 3 4 5
Add BUBBLE
BUBBLE
Data Independency
Data Dependency
Solution: Stall the adder until the multiplier has finished.
MUL t3,t2,t1ADD t6, t5,t4
MUL t3,t2,t1ADD t6, t3,t4
• Output dependency:
• An output dependency occurs if two paralleled instructions are writing into the same location. An error occurs if the second instruction implement before the first one.
MUL t3,t2,t1;ADD t3,t4,t5;
• Antidependency:
• An antidependency exists if an instruction uses a location as an operand while a following one is writing into that location; if the first one is still using the location when the second one writes into it, an error occurs.
MUL t3,t2,t1;ADD t2,t4,t5;
• Solution for the output independency and antidependency: Use other register.
MUL t3,t2,t1;ADD t3,t4,t5;
MUL t3,t2,t1;ADD t6,t4,t5;
MUL t3,t2,t1;ADD t2,t4,t5;
MUL t3,t2,t1;ADD t6,t4,t5;
Alternative ways to handle dependency:Compiler will generate instructions with less dependency.
• Summary: Cortex architecture is a high speed architecture by using deeper pipeline and superscalar pipeline.
Thank you