Title: Hitachi SuperH SH-4
1Hitachi SuperH SH-4
- By Herman Sheremetyev
- 5/10/2002
2Inspiration
I was inspired to do this presentation on the
Hitachi SH-4 processor because this is the
processor used in the Sega Dreamcast video game
system. I own a Dreamcast and after being
assigned this project I became very interested in
its internal workings. As a result of my
research I found that there was quite a bit of
software ported to this platform, starting with a
NetBSD port and followed by a Linux port which
can actually transform the Dreamcast into a
usable X terminal. These ports were largely
possible due to the fact that Hitachi released
the complete specifications as well as a
Programmers Manual for the processor. What
follows are excerpts from the Hitachi Hardware
Manual that briefly describe SH-4s most
interesting aspects which I loosely tailored to
the Dreamcast implementation.
3Sources
- Most of the information in this presentation is
taken from the Hitachi Hardware Manual on the SH4
family of processors - The manual can be found at http//www.julesdcdev.
com/ and probably on the Hitachi website
4Features Summary
- The SH-4 (SH7750 Series (SH7750, SH7750S)) has
been developed as the top-end model in the
SuperH RISC engine family, featuring a 128-bit
graphic engine for multimedia applications and
360 MIPS performance.
5Features
- In addition to single- and double-precision
floating-point operation capability, the on-chip
FPU has a 128-bit graphic engine that enables
32-bit floating-point data to be processed 128
bits at a time. - It also supports 4 4 array operations and inner
product operations, enabling a performance of 1.4
GFLOPS to be achieved.
6Features
- Operating frequency is 200Mhz
- A superscalar architecture is employed that
enables simultaneous execution of two
instructions (including FPU instructions) - An 8-kbyte instruction cache and 16-kbyte data
cache are also provided, and the on-chip memory
management unit (MMU) handles translation from
the 4-Gbyte virtual address space to the physical
address space.
7Registers
- Sixteen 32-bit general registers (and eight
32-bit shadow registers) - Seven 32-bit control registers
- Four 32-bit system registers
- Register operands are always longwords (32 bits).
When a memory operand is only a byte (8 bits)or a
word (16 bits), it is sign-extended into a
longword when loaded into a register.
8Data Formats in Memory
- Memory data formats are classified into bytes,
words, and longwords. Memory can be accessed in
8-bit byte, 16-bit word, or 32-bit longword form.
A memory operand less than 32 bits in length is
sign-extended before being loaded into a
register. - A word operand must be accessed starting from a
word boundary (even address of a 2-byte unit
address 2n), and a longword operand starting from
a longword boundary (even address of a 4-byte
unit address 4n). An address error will result
if this rule is not observed. - A byte operand can be accessed from any address.
9Endianess
- Big endian or little endian byte order can be
selected for the data format. Big endian is the
preferred method of operation. - The endian cannot be changed dynamically.
- Bit positions are numbered left to right from
most-significant to least-significant. Thus, in a
32-bit longword, the leftmost bit, bit 31, is the
most significant bit and the rightmost bit, bit
0, is the least significant bit.
10Operand and Instruction Caches
- The operand cache consists of 512 cache lines,
each composed of a 19-bit tag, validity bit(V),
dirty bit(U), and 32-byte data. - The instruction cache consists of 256 cache
lines, each composed of a 19-bit tag, validation
bit (V), and 32-byte data (16 instructions). - (Tag - stores the upper 19 bits of the 29-bit
external memory address of the data line to be
cached.)
11Cache-Memory coherence
- Coherency between cache and external memory
should be assured by software. - Several cache operations instructions are
provided, including a prefetch instruction
12Cache operations (operand cache only)
- Invalidate instruction OCBI _at_Rn Cache
invalidation (no write-back) - Purge instruction OCBP _at_Rn Cache invalidation
(with write-back) - Write-back instruction OCBWB _at_Rn Cache
write-back - Allocate instruction MOVCA.L R0,_at_Rn Cache
allocation
13Floating Point Unit (FPU)
- Conforms to IEEE754 standard
- 32 single-precision floating-point registers (can
also be referenced as 16 double-precision
registers) - Two rounding modes Round to Nearest and Round to
Zero - Two denormalization modes Flush to Zero and
Treat Denormalized Number - Six exception sources FPU Error, Invalid
Operation, Divide By Zero, Overflow, Underflow,
and Inexact - Comprehensive instructions Single-precision,
double-precision, graphics support, system
control
14FPU Data Formats
- A floating-point number consists of the following
three fields - Sign (s)
- Exponent (e)
- Fraction (f)
- 32 bit Single-Precision (s1,e8,f23)
- 64 bit Double-Precision (s1,e11,f52)
15FPU Rounding
- Round to Nearest The value is rounded to the
nearest expressible value. If the unrounded
value is 2Emax (2 2(P)) or more, the result
will be infinity with the same sign as the
unrounded value. - Round to Zero The digits below the round bit of
the unrounded value are discarded. If the
unrounded value is larger than the maximum
expressible absolute value, the value will be the
maximum expressible absolute value.
16FPU Graphics Support
- The SH7750 Series supports two kinds of graphics
functions - instructions for geometric operations
- pair single-precision transfer instructions that
enable high-speed data transfer.
17FPU Geometric functions
- Geometric operation instructions perform
approximate-value computations. To enable
high-speed computation with a minimum of
hardware, the SH7750 Series ignores comparatively
small values in the partial computation results
of four multiplications.
18FPU Pair Single-Precision Data Transfer
- In addition to the geometric operation
instructions, the SH7750 Series also supports
high-speed data transfer instructions. - These instructions enable two single-precision (2
32-bit) data items to be transferred that is,
the transfer performance of these instructions is
doubled.
19Instruction Format
- the instruction set is implemented with 16-bit
fixed length instructions. - operations are basically executed using
registers. - Except for bit-manipulation operations such as
logical AND that are executed directly in memory,
operands in an operation that requires memory
access are loaded into registers and the
operation is executed between the registers.
20Instruction Format (contd)
- Delayed Branches Except for the two branch
instructions BF and BT, branch instructions and
RTE are delayed branches. (In a delayed branch,
the instruction following the branch is executed
before the branch destination instruction.) - Constant Values An 8-bit constant value can be
specified by the instruction code and an
immediate value. 16-bit and 32-bit constant
values can be defined as literal constant values
in memory
21Addressing Modes
- Register direct
- Register indirect (supports post and pre
decrement and increment as well as displacement) - Indexed register indirect, i.e. the effective
address is sum of register Rn and R0 contents. - Immediate
22Instruction Set
- Over 100 different instructions including FP,
mostly variations on MOV, ADD, etc. to
accommodate different addressing modes. - Instruction mnemonic
- OP, Sz, SRC, DEST
- OP Operation code
- Sz Size
- SRC Source
- DEST Source and/or destination operand
23Instruction Level Parallelism
- The SH7750 Series is a 2-ILP (instruction-level-pa
rallelism) superscalar pipelining microprocessor. - Instruction execution is pipelined, and two
instructions can be executed in parallel. - Parallel execution depends on the instructions
not all instructions can be executed in parallel
with all others
24Pipelining
- The instruction pipeline has 5 stages
- Instruction fetch (I)
- decode and register read (D)
- execution (EX/SX/F0/F1/F2/F3)
- data access (NA/MA)
- write-back (S/FS)
25ILP Illustration
http//www.hitachisemicond
uctor.com/sic/jsp/japan/eng/products/
mpumcu/32bit/image/2_way.gif
26Direct Memory Access
- The SH7750 Series includes an on-chip
four-channel direct memory access controller
(DMAC). - The DMAC can be used in place of the CPU to
perform high-speed data transfers among external
devices equipped with DACK (DMA transfer end
notification), external memories, memory mapped
external devices, and on-chip peripheral modules
(except the DMAC, BSC, and UBC). - Using the DMAC reduces the burden on the CPU and
increases the operating efficiency of the chip.
27Serial Communication Interface (SCI)
- The SH7750 is equipped with a single-channel
serial communication interface (SCI) and a single
channel serial communication interface with
built-in FIFO registers (SCI with FIFO SCIF). - The SCI can handle both asynchronous and
synchronous serial communication. A function is
also provided for serial communication between
processors (multiprocessor communication
function).
28Smart Card Interface
- An IC card (smart card) interface conforming to
ISO/IEC 7816-3 (Identification Card) is supported
as a serial communication interface (SCI)
extension function. - Switching between the normal serial communication
interface and the smart card interface is carried
out by means of a register setting.