sh-4r technical overview - dcemulationdcemulation.org/1-newsdump/qrandom/dc stuff/info/sh... · the...
TRANSCRIPT
SH-4R The High-End Multimedia architecture 2
SH-4R RISC Multimedia enhancements 2
Two-way Superscalar support - The leading performance per MHz 2
Integrated Floating Point Unit co-processor (FPU) - Embedding your PC Software 3
DSP/SIMD instruction - The true Multimedia acceleration 3
16-bit Instruction length - The smallest memory footprint 4
The SH-4R architecture features 5
Register configuration 5
Instruction set - The full compatibility 6
Flexible data format 6
Five-stage pipeline 6
Optimised cache architecture 7
Store queues - high speed burst transfers 7
Very large address space 8
Low power consumption 8
Optimised technology & quality 8
SH-4R Architecture
Technical Overview
2
SH-126 MIPS
SH-278 MIPS
CacheMAC
SH-3260 MIPS
MMU
SH-51000 MIPS
SIMPFPU
SH-DSP130 MIPS
DSP
SH-3-DSP260 MIPS
DSP
SH-62000 MIPS
VLIW
SH-4R>480 MIPS
Superscalar FPU
NEW
ROM FLASH
Deeply Embedded
MMU Cache
Multimedia Core
64-bit
Instruction Set Upward Compatibility
DSP S
uppo
rt
Instruction Queue
Decoder 1 Decoder 2
Integer Unit
LoadStoreUnit
FloatingPointUnit
BranchUnit
Dedicated Core for Ded i cated Markets
SH-4R Supersca lar Arch i tec ture
SH-4R – The High-end Multimedia architecture
The standard RISC (Reduced Instruction Set Computing) features
have been extended by Hitachi to create an enhanced microprocessor
core dedicated for embedded applications requiring DSP and
multimedia accelerations as well as high performance general
purpose functions.
The SH-4R core is a first-class choice for multimedia applications
such as car infotainment systems, game consoles, multimedia
terminals, video conferencing, interactive TV and many others.
Algorithms such as speech codecs, advanced audio MP3, voice
recognition, MPEG motion video processing and 3D-graphics
calculations benefit directly from this dedicated architecture.
SH-4R RISC Multimedia enhancements
The SH-4R architecture has been optimised to provide higher performances and lower power consumption at a reduced price, whilst keeping
the full software compatibility with the former SH4 core. This means that moving to the SH-4R is really the simplest way to boost your system
performance without adding additional cost associated with redesign of your hardware and software environment.
The main improvements in the SH-4R core are the higher frequency supported, and the change of the cache architecture (doubled in size and
two-way set associative). This considerably reduces the cache misses and significantly increases the overall system performance. Also the DMA
controllers have been extended to handle more media stream transfers.
The new technology reduces the power consumption considerably and allows running at a higher speed whist reducing the cost/performances
ratio.
Two-way Superscalar support – The leading performance per MHz
The Superscalar implementation allows the decoding and
execution of two instructions in parallel. This generates a
maximum performance of up to two instructions per cycle (1.5
instruction average) including FPU.
The Superscalar architecture allows an integer processing power of
more than 1.8MIPS/MHz (Dhrystone1.1)
Stage 1
A single DIF butterfly
Stage 4
16-tap, 40-sampleblock FIR(1.6 MACs/cycle)
1024-point,radix-2 FFT(35.4k cycles)
In1
In2
Out1
Out2-1 W=a+jb
= x + ... +
yi
yi+1
yi+2
yi+3
c0c1c2c3
xi
xi+1
xi+2
xi+3
xi+1
xi
xi+1
xi+2
xi-2
xi-1
xi
xi+1
xi-3
xi-2
xi-1
xi-
=x
In1_rIn1_iIn2_rIn2_i
Out1_rOut1_iOut2_rOut2_i
10ab
01-ba
10-a-b
01b-a
x
c12c13c14c15
xi-12
xi-11
xi-10
xi-9
xi-13
xi-12
xi-11
xi-10
xi-14
xi-13
xi-12
xi-11
xi-15
xi-14
xi-13
xi-12
rayd
Rnr1 r2 r3 0
d1
d2
d3
0
x=I
Multiplication3D graphicsgeometry
Multiplication
Surface judgementbrightness calculation source
Inner Product
A single SuperH instructionFTRV can perform thismatrix-vector multiplicationevery 4 cycles
A single SuperH instructionFIPR can perform thisinner product multiplicationevery cycle
= x
Y1Y2Y31
x1x2x31
c11c21c31c41
c12c22c32c42
c13c23c33c43
c14c24c34c44
3
Integrated FPU co-processor – Embedding your PC Software
The SH-4R Core includes an IEEE-754 compliant floating point co-processor, which supports single-precision (32-bits) and double-precision
(64-bits) numbers. The SH-4R FPU can process an impressive seven million polygons per MHz.
The FPU is a major building block required in systems running multimedia applications coming from the PC market. Most of the software
developers programming on a PC-based environment are using FPU variables for their convenience. The transfer of this software on the
SH-4R architecture will be straightforward and directly optimised by the compiler by using the FPU.
DSP / SIMD instruction – The true Multimedia acceleration
The matrix math enhancements contribute to the graphics quality and are also suited to accelerate certain signal-processing algorithms.
The SH-4R instruction set includes several SIMD (Single Instruction – Multiple Data) instructions which dramatically accelerate algorithms
based on vector and matrix arithmetic. Vectors and matrixes with a dimension of four are operands and coefficients of these single-issue
instructions. The 128-bit floating-point vector engine executes the highly efficient SIMD instructions like matrix operations (FTRV) and inner
product of two vectors (FIPR) required for 3D graphics processing and data (de)compression.
The other hardware DSP implementations supported are scalar square root, absolute value, divide, and multiply-and-accumulate operations.
SH-4R FPU used to accelerate MP3, MPEG-4, Speech Recognition, VoIP algorithms.SH-4R running at 240MHz decodes Windows Media, Audio and Video at 24 frames/sec, on a QVGA LCD and 16 bits per pixel.
Power fu l 4 -way FPU for 3 -D graph i c s
Versat i l i t y o f the SH-4R ar i thmet i c acce lerator
4
Lower memory footpr in t w i th SH
16-bit Instruction length - The smallest memory footprint
A fixed 16-bit instruction length offers a very high code density to save cost for storage memory (external RAM, ROM and cache) and solves
the bandwidth bottleneck problem of conventional 32-bit RISC architectures. On a 32-bit memory access, two instructions can be fetched in
parallel, reducing the necessary memory accesses by a factor of two.
There is an average code size ratio of 2:3 between a SuperH processor and other RISC processors. Providing a lower total system cost.
Memory footprint saved.
Effective memory footprintrequired with SH.
SDRAM
Flash
Standard RISC32-bit Opcode
SDRAM
Flash
SH RISC16-bit Opcode
The SH-4R architecture features
The SH-4R Core employs a common 32-bit RISC archi-
tecture designed by Hitachi to meet the requirement of
multimedia and high performance embedded applica-
tions.
The SH architecture has the following basic features:
• Load/Store
• Register orientation
• 32-bit internal data path
• RISC-type instruction set
• 4 Gbyte address space
• Five-stage RISC instruction pipeline
Register configuration
• General Purpose 32-bit Register bank
• 32-bit Control Registers
• 32-bit System Registers
• 32-bit Shadow registers
The SH architecture enables arithmetic and logical
instructions to operate normally on the 32-bit general-
purpose registers. Special load/store instructions are
provided to transfer data from memory to registers
and vice versa. The diagram on the right shows the
basic general purpose 32-bit register bank which is
used for source and destination operands. In addition
to the 16 registers the SH-4R has 8 x 32-bit shadow
registers, which can be accessed in the so-called
privilege mode. As well as the general purpose
registers, the SH architecture supports four system
registers providing a Program Counter (PC), Procedure
Register (PR), and two 32-bit Multiply and Accumulate
Registers (MACH/MACL). The block of control
registers contains the Status Register (SR), the Global
Base Register (GBR), the Vector Base Register (VBR),
the Saved Status Register (SSR) and the Saved Program
Counter (SPC).
5
I-Cache (16KB)
CPG
INTC
SCI(CH1,2)
RTC
TMU
26-bit Address 64-bit Data
SH-4 CPU Core FPUUBC
D-Cache (32KB)ITLB UTLBCache &TLB Controller
Lower 32-bit Data
Lower 32-bit Data
BSC DMAC
External BusInterface
32-b
it A
ddre
ss (
Inst
ruct
ion)
Per
iphe
ral A
ddre
ss
16-b
it P
erip
hera
l Dat
a B
us
32-b
it D
ata
(Ins
truc
tion
)
Upp
er 3
2-bi
t D
ata
32-b
it A
ddre
ss (
Dat
a)
29-b
it A
ddre
ss
32-b
it D
ata
32-b
it D
ata
Add
ress
64-b
it D
ata
64-b
it D
ata
64-b
it D
ata
(Sto
re)
32-b
it D
ata
(Sto
re)
32-b
it D
ata
(Loa
d)
29-bit Address
32-bit Data
32-bit Data
MACH (Multiply and Add Accumulator High)
MACL (Multiply and Add Accumulator Low)
PC (Procedure Register)
31
SR (Status Register)
GBR (Global Base Register)
VBR (Vector Base Register)
31
ROGENERAL PURPOSE
REGISTERS
CONTROLREGISTERS
SYSTEMREGISTERS
PROGRAMCOUNTER
R1
R13
R14
R2
31
PC (Program Counter)31
0
0
0
0
31
31
31
0
0
0
ALL
R15, SP (Stack Pointer)
SR (Saved Status Register)
RO (Shadow Registers)
R2
SPC (Saved Program Counter)
SH3, SH3-DSP, SH4
R7
SH7750R Funct iona l B lock Diagram
SH Arch i tec ture Genera l Purpose Reg i s ter Bank ,Contro l Reg i s ters and System Reg i s ters
6
IFIDEXMA
= Instruction Fetch= Instruction Decode= Execution= Memory Access
WBmm^)
= Write Back= Multiplier Operation= If instruction also uses the multiplier pipeline, contention can occur
Next instruction
Third instruction in series
: Slot
: Slot
IF ID EX .....
IF ID EX .....
Instruction A IF ID EX
Next instruction
Third instruction in series
.....
: Slot
IF ID EX .....
IF ID EX .....
IF ID EX .....
Instruction A IF ID EX
Next instruction^)
Third instruction^)
IF ID EX MA WB .....
IF ID EX MA WB .....
MULS.W IF ID EX MA mm mm
Inst ruc t ion P ipe l in ing Examples
Instruction set – The full upward compatibility
The instruction set has been carefully chosen to provide a high-
level language orientation, thus simplifying programming of the
individual devices.
A major strength of the SH-4R core is the instruction set’s
upward compatibility with SH1, SH2 and SH3, which allows
customers to move easily from one SH core to another. Each
SH core is designed and targeted for specific applications with
different requirements such as integration, performance and
cost.
The SH-4R supports up to 91 types of instructions including data
transfer, arithmetic operations (with on-chip multiplier), logic
operations, shift, branch, system control and DSP/FPU classes.
The SH-4R C and C++ compilers are optimised to use the
maximum benefit of each class of instructions.
Flexible data format
Memory data formats are classified into bytes,
words and longwords. The SH-4R supports big
and little endian mode. This choice is a major
advantage to simplify and ease the connection of
external devices around the SH-4R processors
Five-stage pipeline
The advantage of using a RISC approach can be seen by the
pipeline mechanism, allowing very high clock frequencies. The
pipelining mechanism of the SH provides a single cycle peak
throughput for the basic instructions (two in the case of the
Superscalar SH-4R). The pipeline is automatically reduced if an
instruction does not need all stages, and extended if an instruction
needs some more latency cycles to be completed or if pipeline
contention occurs. To reduce pipeline penalties, a delay-slot
mechanism has been provided, reducing pipeline-breakage.
SH architecture provides a full software compatibility from 10MIPS to more than 1000 MIPS
SH-156 types
SH-262 types
SH-368 types
SH-491 types
32-bit multiplier/
accumulator
MMU instructions
SH3-DSP 160 typesMMU & DSP instructions
SH-DSP 154 typesDSP instructions
FPU, Graphicsinstructions
Word0
Longword
Byte0 Byte1 Byte2 Byte3
Word1
Rn
31 23
AddressA
AddressA
Big endian
AddressA+4
AddressA+8
AddressA+8
AddressA+4
AddressA
AddressA+1
15
AddressA+2
7 0
AddressA+3
Word1
Longword
Byte3 Byte2 Byte1 Byte0
Word0
31 23
AddressA+11
Little endian
AddressA+10
15
AddressA+9
7 0
AddressA+8
Inst ruc t ion Set Upward Compat ib i l i t y
Memory data formats , Byte , Word and Longword Al ignment
Running at 200MHzWind River Systems Inc.,JWorks 4.0Operating System: VxWorks
SH7750R
SH7750S
800
Com
pres
sion
Tim
e
700
600
500
400
300
200
100
0
_202
_jess
_201
_com
pres
s
_202
_jack
_202
_java
c
_202
_db
_222
_mpe
gaud
io
PerformanceImprovements
27%13%
30% 16%
23%19%
7
Optimised cache architecture
The SH-4R cache architecture has been significantly optimised over the SH4 Core to boost the real time capabilities and overall system
performances.The instruction cache and operand cache are handled separately. The global cache architecture is a two-way set associative to
reduce cache misses, whilst keeping the latency to a minimum. The cache is doubled (16Kbyte instruction cache and 32KB operand cache).
The SH-4R includes large on-chip caches including copy-back and write-through
buffers to close the performance gap between the processing speed of the CPU and
the bandwidth of external SDRAMs.
This optimisation is especially important for applications using intensive memory
transfers, such as Java-based systems. Significant performance improvements are
achieved by optimising the SH4 cache architecture.
Impressive results of the new cache architecture running on operating systems suchas Linux, VxWorks™, Windows®CE or QNX
Direct
Two-way
Four-way
2 4 8 16 32Cache size
Miss rate
Fast data t rans fers to Frame buf fers us ing the SH-4R Store Queues
Spec JVM98 Benchmark - Compress ion t ime
Store queues – high speed burst transfers
The SH4(R) supports two 32-byte store
queues (SQ) to perform high-speed burst
writes to external memory. While the
contents of one SQ are being transferred to
external memory, the other SQ can be
written to without a penalty cycle. This
functionality is especially useful to transfer
video, graphic or display list data to the
frame buffer of the graphic processor.
GeometryProcessor
Direct Access
Display Controller
Display
Frame Buffer
SDRAM
SH-4R CP
U In
terf
ace
Loca
l SH
Bus
Fram
e B
uffe
rIn
terf
ace
Graphic Processor
Primitives of
Display List
SQ 1-32 Bytes
SQ 2-32 Bytes
Two-way se t -assoc ia t i ve cache :for opt imised la tency/cache miss ba lance
August 2002 Printed in Europe 19-040B
EUROPEAN HEADQUARTERS
Hitachi Europe Ltd.Whitebrook Park, Lower Cookham Road, Maidenhead, Berkshire SL6 8YA United KingdomTel: +44-1628 585000 Fax: +44-1628 585160 Email: [email protected]
(Please visit our website for contact details of our Hitachi Sales Offices in EMEA)
www.hitachi-eu.com/semiconductors/
Low power consumption
SH is a low power processor family designed to produce
best-in-class performance/watt ratio. The SH-4R core
features an advanced power saving mechanism with built-in
power down modes:
• Sleep, on-chip peripherals still run
• Standby, on-chip peripherals halt
• Module stand-by, specified module haltLatest technology further reduces the low power consumption ofthe SH-4R
Area 0
26-bitExternal Memory Space
32-bit32-bit
0x00000000
0x80000000
0xA0000000
0xC0000000
0xE0000000
2 GByte virtual space,
cacheable
0.5 GByte cacheable
0.5 GByte non-cacheable
0.5 GByte virtual cacheable
0.5 GByte control non-cacheable
2 GByte virtual space,
cacheableArea U0
0xE 000 0000
0xE 400 0000Address errorUser ModePrivileged Mode
Address error
Store queue area
Area 1
Area 2
Area 3
Area 4
Area 5
Area 6
Area 7
Area P0
Area P1
Area P2
Area P3
Area P4
Physical AddressSpace
Large SH-4R Memory Map
Very large address space
The large address space of the SH-4R facilitates the connection/mapping of external devices. The uniform and unsegmented address space
is managed by a Memory Management Unit (MMU) which translates addresses between the physical address space and the virtual one
required by operating systems. Up to 4Gbytes address space (448-Mbyte external memory space) is available.
The SH-4R architecture incorporates two processor modes: user mode and privileged mode. Normal program execution is done in user mode,
privileged mode is normally entered when an exception occurs.
Optimised technology and & quality
The SH-4R core from Hitachi is a leading architecture recognised
in applications running in difficult environments such as
automotive. Over the years, Hitachi has built a strong expertise by
providing high reliability devices in extended operating conditions
on the latest technology process. That is why the SH-4R is also
qualified for wide temperature ranges (-40C to + 85C).