case study: microprocessors€¦ · microprocessor soc: powerpc 405gp dma controller on-3-z opb...

19
SOCSA Slides: Microprocessors © Institute for Integrated Systems Technische Universität München www.lis.ei.tum.de Case Study: Microprocessors System-on-Chip Solutions & Architectures A. Herkersdorf © Institute for Integrated Systems A. Herkersdorf SoC - Microprocessors - 2 Microprocessors Motivation Classification and Characteristics Look Inside How to Increase CPU Performance

Upload: others

Post on 02-Aug-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB Bridge UART (2) I2C (2) GPIO Arb CPU Slower & larger (S)DSRAM 32K 32K I -Cache D Cache

SOCSA Slides: Microprocessors

© Institute for Integrated Systems Technische Universität München www.lis.ei.tum.de

Case Study: Microprocessors

System-on-Chip

Solutions & Architectures A. Herkersdorf

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 2

Microprocessors

Motivation

Classification and Characteristics

Look Inside

How to Increase CPU Performance

Page 2: Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB Bridge UART (2) I2C (2) GPIO Arb CPU Slower & larger (S)DSRAM 32K 32K I -Cache D Cache

SOCSA Slides: Microprocessors

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 3

Motivation (1)

Processor-based Digital Systems: Computers with fully programmable, general-

purpose processors (PCs, laptops, workstations)

Primary purpose / function is data processing (incl. Web servers, bank servers)

Hardware & software evolve rather independently

However, most processors are deployed in „embedded systems“

Game consoles, PDAs, cell phones, printers, household appliances, …

Cars, industry robots, …

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 4

Motivation (2)

Network Equipment: Internet Router

Routing table entries grow exponentially

Link rates:

2.5 Gb/s: 6.5Mpps

10 Gb/s: 25Mpps

Mega Bytes memories with Giga Bytes / s access bandwidth !

Source: http://telstra.net/ops/bgptable.html 90 95 99 00 01

120 k

100 k

80 k

40 k

Page 3: Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB Bridge UART (2) I2C (2) GPIO Arb CPU Slower & larger (S)DSRAM 32K 32K I -Cache D Cache

SOCSA Slides: Microprocessors

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 5

Motivation (3)

Network Equipment: Internet Router MIPS Processing requirements

per packet vary substantially depending on application

10‘s K effective MIPS!

10‘s of GHz class processors

Source: Jenkins, "NPU Co-Processors", 2000

OC-3 OC-12 OC-48 OC-192

b / s 155 M 622 M 2.5 G 10 G

pkts / s 420 K 1.7 M 6.8 M 27 M

s / pkt 2.4µ 600 n 150 n 37 n

NP case study will tell us how to tackle this challenge!

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 6

Microprocessor SoC: PowerPC 405GP

DMA Controller O

n-c

hip

Periphera

l B

us (

OP

B)

33-6

6 M

Hz

OPB Bridge

UART (2)

I2C (2)

GPIO

Arb

CPU

32K I-Cache

32K D-Cache

MMU

Trace JTAG

Processor Local Bus (PLB) up to 133MHz 128-bit

SRAM Ctlr.

128KB SRAM

10/100 Ethernet

MAC

Timers

Interrupt Controller

MAL

128-bit 128-bit

DDR266 SDRAM

Controller

266MHz 32/64-bit with ECC

128 bit

128-bit

PCI-X Bridge

66-133MHz 64-bit PCI-X, 33-66MHz 32/64bit PCI

128-bit master, 128-bit slave 128 bit

128-bit

RAM/ROM/ Peripheral controller

External bus master cntlr.

Up to 66MHz 32-bit address / 32-bit data

128-bit GPIO

13 external interrupts

1 MII or 2 RMII interfaces

GPT

PLB Monitor

Cache

CPU

Local Bus

Fast & Small SRAM

Slower & larger (S)DSRAM

I/O Subsystem (SCSI, PCI, etc)

Disk

Tape

Page 4: Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB Bridge UART (2) I2C (2) GPIO Arb CPU Slower & larger (S)DSRAM 32K 32K I -Cache D Cache

SOCSA Slides: Microprocessors

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 7

Real-World Case Studies

Sonet/SDH Transmission LAN/SAN

Switch

Internet Router

Sonet/SDH Transmission

Control Procesors

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 8

Classification and Characteristics

Type Application Characteristics Remarks

CISC Personal Computer Complex, variable-length instructions

Intel x86-based

RISC Embedded control Load/store instruct‘s for memory access

MIPS, PowerPC

DSP xDSL Modem HW multiply for digital filters

TI

VLIW Set Top Box Instruct‘s parallelism on compile-time

Parallel video pixel processing

Superscalar Network Protocol Processing

Instruct‘s parallelism on run-time

ASIP Embedded control Application-specific intructions

Tensilica

Page 5: Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB Bridge UART (2) I2C (2) GPIO Arb CPU Slower & larger (S)DSRAM 32K 32K I -Cache D Cache

SOCSA Slides: Microprocessors

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 9

Implementation Strategies for SOC (1)

„Real“

Component

„Virtual“

Component

System on Board System on Silicon

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 10

Implementation Strategies for SOC (2)

Soft VC Firm VC Hard VC

VHDL

Architectural extensions

Speed/Area optimized

Page 6: Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB Bridge UART (2) I2C (2) GPIO Arb CPU Slower & larger (S)DSRAM 32K 32K I -Cache D Cache

SOCSA Slides: Microprocessors

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 11

Soft VC CPU in FPGA SOC

Example: XILINX MicroBlaze CPU

SDRAM Ctrl.

RS-232

GPIO (buttons)

UserLogic (OPB-Master)

GPIO (LEDs)

Debug Logic

Local SRAM

MicroBlaze: 32 bit RISC 200 MHz 166 DMIPS Extensions: I-Cache D-Cache HW Multiplier

For comparison:

Hard VC PowerPC 405: 32 bit RISC 400 MHz 600 DMIPS

MicroBlaze Core

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 12

Today’s primary focus

What is “Machine Structure”?

I/O system Processor

Compiler Operating

System

Applications

Digital Design

Circuit Design

Instruction Set Architecture

Coordination of many levels of abstraction

Datapath & Control

transistors

Memory Hardware

Software Assembler

Page 7: Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB Bridge UART (2) I2C (2) GPIO Arb CPU Slower & larger (S)DSRAM 32K 32K I -Cache D Cache

SOCSA Slides: Microprocessors

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 13

Levels of Representation

High Level Language Program (e.g., C)

Assembly Language Program (e.g.,MIPS)

Machine Language Program (MIPS)

Control Signal Specification

Compiler

Assembler

Machine Interpretation

temp = v[k];

v[k] = v[k+1];

v[k+1] = temp;

lw $to, 0($2) lw $t1, 4($2) sw $t1, 0($2) sw $t0, 4($2)

0000 1001 1100 0110 1010 1111 0101 1000

1010 1111 0101 1000 0000 1001 1100 0110

1100 0110 1010 1111 0101 1000 0000 1001

0101 1000 0000 1001 1100 0110 1010 1111

°

°

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 14

Instruction Set Architecture (ISA)

Defines the interface between software & hardware

Visible hardware state (registers & memory)

A set of instructions that operate on that state

Given an ISA

The hardware implements it

The software uses it

Old SW can use new HW and vice versa

Keep in mind Difference: ISA vs. HW implementation

X86: 80x86 Pentium

Hardware

Software & OS

Instruction Set

Page 8: Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB Bridge UART (2) I2C (2) GPIO Arb CPU Slower & larger (S)DSRAM 32K 32K I -Cache D Cache

SOCSA Slides: Microprocessors

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 15

ISA: What Programers See

Instruction Set Registers Memory Address Space

FFFxxx000

000xxxFFF

Intel‘s mostly used instructions [Hennessy]: • Load • Conditional branch • Compare • Store • Add • And • Sub • Move reg-reg

From total instruction set of ~140

i3886 register set [Intel]

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 16

Basic System Architecture

L1 cache

Memory access: Registers/ L1 cache: 1 cycle L2 cache: 10 cycles ext mem: 50 cycles

Spatial and temporal locality of data and code are the reasons why memory hierarchies perform!

RAM

L2 cache

Page 9: Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB Bridge UART (2) I2C (2) GPIO Arb CPU Slower & larger (S)DSRAM 32K 32K I -Cache D Cache

SOCSA Slides: Microprocessors

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 17

Look Inside

internal address bus

external data bus

internal data bus

ALU

accumulator

control

register block

status

program counter

address i/o

external address bus

data i/o

data cache

instr cache

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 18

Microprocessor Architecture

internal address bus

external data bus

internal data bus

ALU

accumulator

control

register block

status

program counter

address i/o

external address bus

data i/o

data cache

instr cache

Page 10: Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB Bridge UART (2) I2C (2) GPIO Arb CPU Slower & larger (S)DSRAM 32K 32K I -Cache D Cache

SOCSA Slides: Microprocessors

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 19

Microprocessor Architecture

internal address bus

external data bus

internal data bus

ALU

accumulator

control

register block

status

program counter

address i/o

external address bus

data i/o

data cache

instr cache Instruction fetch (IF) Instruction decode (ID)

Operand fetch (OF) Execution (EX)

Write back (WB)

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 20

Microprocessor Architecture

internal address bus

external data bus

internal data bus

ALU

accumulator

control

register block

status

program counter

address i/o

external address bus

data i/o

data cache

instr cache

Memory load (ld) Memory store (st)

Page 11: Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB Bridge UART (2) I2C (2) GPIO Arb CPU Slower & larger (S)DSRAM 32K 32K I -Cache D Cache

SOCSA Slides: Microprocessors

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 21

Pipelining

IF ID OF EX WB

IF ID M WB

IF ID OF M

add r3,r2,r1

Sequential machine

ld r1,0(r0)

st r3, 4(r0)

CPI = 4 - 5

… multiple instructions execute faster: CPI 1

IF ID OF EX WB

Pipelined processor

M

IF ID OF EX WB M

IF ID OF EX WB M

add r3,r2,r1

ld r1,0(r0)

st r3, 4(r0)

Individual instruction may take longer, …

EX

© Institute for

Integrated Systems

A. Herkersdorf

CPU Pipeline

SoC - Microprocessors - 22

IF

Pipeline Control

ID EXE MEM WB

Buffer

clk

ΣTlogic ΣTlogic ΣTlogic

stp

max

logicc2qclk TTTT

Single-scalar = 1 ALU, CPImin = 1.0

clk

maxT

f1

instr._rate[MIPS] =

= f[MHz]/CPI

D Q D

clk

Tstp Tc2q

Q

Page 12: Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB Bridge UART (2) I2C (2) GPIO Arb CPU Slower & larger (S)DSRAM 32K 32K I -Cache D Cache

SOCSA Slides: Microprocessors

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 23

Pipelining

Prerequisite for effective pipelining Regularity in sequence of individual instruction phases

Few, regular instruction set

Simple, few addressing modes

Deep pipelining Ease processor speed scaling

Increase vulnerability for pipeline problems

Data hazards

Branch conflicts

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 24

Data Hazard

IF ID OF EX WB M

IF ID OF EX WB M

IF ID OF EX WB M

add r3,r2,r1

sub r7,r3,r1

and r6,r3,r2

Dependencies back in time cause data hazards

IF ID OF EX WB M

IF ID OF EX WB M

ID OF EX M

add r3,r2,r1

sub r7,r3,r1

and r6,r3,r2

Eliminate reverse time dependency by stalling

stall IF WB

Page 13: Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB Bridge UART (2) I2C (2) GPIO Arb CPU Slower & larger (S)DSRAM 32K 32K I -Cache D Cache

SOCSA Slides: Microprocessors

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 25

Branching

IF ID OF EX WB M

IF ID OF EX WB M

IF ID OF EX WB M

bcctr r3

shr r7, r1

and r6,r3,r2

Deviation from sequential program execution

Stall, or exploit advanced concepts like “branch prediction”

If r3 points back in address space, it‘s more likely that branch is taken

bcctr r3 r3 addr1 addr2

addr1

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 26

Performance

What is performance? Example Porsche vs. Bus from Munich to Stuttgart

Vehicle Top speed

[km/h]

Distance

[km]

Travel time

[h]

Porsche 260 200 0.77

Bus 120 200 1.6

Capacity Throughput [person] [pkm/h]

2 520

46 5520

What matters in CPU performance: Fastest possible execution of single instruction?

Shortest program execution time (many instructions)?

Page 14: Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB Bridge UART (2) I2C (2) GPIO Arb CPU Slower & larger (S)DSRAM 32K 32K I -Cache D Cache

SOCSA Slides: Microprocessors

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 27

Processor Performance

Ultimately interested in: CPU execution time: Time CPU needs to complete

certain program, task or function

CPU time = x Clock cycles

Program

Seconds

Clock cycle

Instructions

Program

Clock cycles

Instruction

Seconds

Clock cycle = x x

Specific for your application

Estimate/count after compilation

1 / fcpu

Processor data sheet

CPI: Processor architecture and memory hierarchy dependent

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 28

Processor Performance

CPI = CPICPU + CPIMEM

CPIMEM = CPIIaccess + CPIDaccess

= IFreq x L1miss_rate (L1miss_penalty + L2miss_rate x L2miss_penalty) + DaccFreq x L1miss_rate (L1miss_penalty + L2miss_rate x L2miss_penalty)

Page 15: Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB Bridge UART (2) I2C (2) GPIO Arb CPU Slower & larger (S)DSRAM 32K 32K I -Cache D Cache

SOCSA Slides: Microprocessors

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 29

Processor Performance - Example

Pipelined RISC CPU: CPICPU =1.2

Two-level cache hierarchy: L1miss_rate = 5%; L1miss_penalty = 10 cycles L2miss_rate = 3%; L2miss_penalty = 50 cycles DaccFreq = 20% CPIMEM = 0.69

0.15% instr./data accesses to system memory degrade overall performance (CPU execution time) by 57%

CPIno_miss 1.2

CPImiss 1.89 = = 1.57

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 30

Microprocessor Performance

[Xilinx]

Page 16: Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB Bridge UART (2) I2C (2) GPIO Arb CPU Slower & larger (S)DSRAM 32K 32K I -Cache D Cache

SOCSA Slides: Microprocessors

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 31

How to Increase CPU Performance ?

• Pipelining

• Application specific ISA extensions

• Multiple ALUs and Control units

• Superscalar

• VLIW (Very Long Instruction Word)

• Multithreading

• Memory hierarchy design

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 32

DSP Architecture

HW multiply unit

internal address bus

external data bus

internal data bus

Multiply

accumulator

control

register block

status

program counter

address i/o

external address bus

data i/o

data cache

instr cache

Page 17: Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB Bridge UART (2) I2C (2) GPIO Arb CPU Slower & larger (S)DSRAM 32K 32K I -Cache D Cache

SOCSA Slides: Microprocessors

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 33

SIMD / MIMD

internal address bus

external data bus

internal data bus

Datapath

accumulator

control

register block

status

program counter

address i/o

external address bus

data i/o

data cache

instr cache

Single Instruction Multiple Data: • single control / multiple datapaths

Multiple Instruction Multiple Data: • multiple controls

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 34

VLIW – Very Long Instruction Word

....

SequentialProgram

Instr i+2

Instr i+1

Instr iInstr i-1

Instr i-2

....

DP 1 DP 2 DP 3 DP 4 DP n-1 DP n

Registers

Optimizing Compiler

InstrDP2InstrDP1 InstrDP3 InstrDP4 InstrDPn-1 InstrDPn.. ... ... ... ... ..

.... Datapath .....

Determined during Compile-time

Page 18: Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB Bridge UART (2) I2C (2) GPIO Arb CPU Slower & larger (S)DSRAM 32K 32K I -Cache D Cache

SOCSA Slides: Microprocessors

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 35

Superscalar

OFi+2

EXi+2

WBi+2

DP 1

ID2i

OFi

EXi

WBi

DP 2

ID2i+3

OFi+3

EXi+3

WBi+3

DP 3

ID2i+1

OFi+1

EXi+1

WBi+1

DP 4

Instr pre-decode (ID1i ... ID1i+3)

Instr fetch (IFi ... IFi+3)

Instr distribute

ID2i+2

Decided at Run-time

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 36

Multithreading in Hardware

internal address bus

external data bus

internal data bus

ALU

accumulator

control

register block

status

program counter

address i/o

external address bus

data i/o

data cache

instr cache

Multiple register banks

Page 19: Case Study: Microprocessors€¦ · Microprocessor SoC: PowerPC 405GP DMA Controller On-3-z OPB Bridge UART (2) I2C (2) GPIO Arb CPU Slower & larger (S)DSRAM 32K 32K I -Cache D Cache

SOCSA Slides: Microprocessors

© Institute for

Integrated Systems

A. Herkersdorf SoC - Microprocessors - 37

Multithreading in Software

internal address bus

external data bus

internal data bus

ALU

accumulator

control

register block

status

program counter

address i/o

external address bus

data i/o

data cache

instr cache

reg

iste

r blo

ck

sta

tus

pro

gra

m c

ou

nte

r

reg

iste

r blo

ck

sta

tus

pro

gra

m c

ou

nte

r

reg

iste

r blo

ck

sta

tus

pro

gra

m c

ou

nte

r

Load/save register status