Title: Computer architecture
1Computer architecture
- Lecture 6 Processors structure
- Piotr Bilski
2Procesors tasks
- Instruction fetching
- Instruction interpretation
- Data fetching
- Data processing
- Data saving
- These justify existence of the registers
(temporary memory space)
3Internal processors structure
ALU
Status flags
Registers
Shifter
Complementer
Arithmetic and Boolean Logic
Control Unit
4Block Scheme of Pentium 3 Processor
5Block Scheme of P6 Core (Pentium Pro) 1995 r.
- Front-end of the processor
- Core
- Completion unit
6Register types
- Accessible for the user (addressing, data etc.)
- Inaccessible for the user (control, status)
- This categorization is not formal!
7Registers accessible by the user
- General Purpose Registers (GPR)
- Data
- Addressing (segment pointer, stack, indexing)
- Conditional codes (state pointer, flags)
read-only!
8Control and state registers
- Basic
- Program Counter (PC)
- Instruction Decoding Register (IR)
- Memory Address Register (MAR)
- Memory Buffer Register (MBR)
- Program Status Word (PSW)
- Interrupt Vector Register
- Page Table Pointer
9Program Status Word
0 3 4
15
P
R
OTHER
S
Z
O
I
N
S sign bit Z bit set, if operation result is
zero P carry bit R logical comparison result
bit O overflow bit I Enable/disable
interrupt execution N supervisor mode
10Registers in the Motorola MC68000 processor
- Data and address registers (32-bit)
- Specialization 8 data registers (D0-D7) and 9
address registers (two used interchangeably in
the user and supervisor modes) - Control bus 24-bit, data bus 16-bit
- A7 register used as a Stack Pointer (SP)
- State register (SR)16-bit (another name CCR)
- Program counter (PC) 32-bit
- Instructions are stored under even addresses
11Registers in the Intel 8086 Processor
- 16-bit address and data registers
- Data/General Purpose Registers (AX, BX, CX, DX)
- Pointer and index registers (SP, BP, SI, DI)
- Segment registers (CS, DS, SS, ES)
- Instruction pointer
- State register
12Intel 8086 Registers (cont.)
SP BP SI DI
Stack pointer
AX BX CX DX
Accumulator
Base pointer
Base
Source index
Counting
Displ. ndex
Data
13Intel 386 - Pentium Processors Registers
Organization
- 32-bit data and address registers
- Eight General Purpose Registers (EAX, EBX, ECX,
EDX, ESP, EBP, ESI, EDI) - For the backward compatibility, the lower part of
the registers are 16-bit registers - 32-bit status register
- 32-bit instruction pointer
14Floating-point registers of the Pentium processor
- Eight 80-bit numerical registers
- 16-bit control register
- 16-bit state register
- 16-bit floating point register content type word
- 48-bit instruction pointer
- 48-bit data pointer
15EFLAGS register
0
15
21
31
ZF
SF
TF
IF
DF
OF
IOF
NT
RF
ID
AF
VM
AC
VIF
VIP
CF
PF
- TF trap flag
- IF interrupt enable flag
- DF direction flag
- IOPL privileged input/output flag
- RF resume flag
- AC alignment control
- ID identification flag
16Registers in the Athlon 64 processor
- Compatibility with x86-64 architecture (40-bit
physical address space, 48-bit virtual address
space) - Data and address registers 64-bit
- 8 general purpose registers (RAX, RBX, RCX, RDX,
RBP, RSI, RDI, RSP), work in the 32-bit
compatibility mode - Opteron contains additional 8 general purpose
registers (R8-R15) - 16 SSE registers (XMM0-XMM15)
- 8 floating-point registers x87, 80-bit
17Registers in the PowerPC processor
- 32 general purpose registers (64-bit) exception
register (XER) - 32 registers for the floating point unit (64-bit)
state and control register (FPSCR) - Branch processing unit registers 32-bit
condition register, 64-bit counting and binding
registers
18Instruction mode
Indirect addressing
Argument address calc.
Argument fetching
Instruction fetch
Multiple arguments
Multiple results
Argument address calc.
Instruction address calc.
Data operation
Writing argument
Instructiondecoding
No interrupts
Return to data
Instruction executed, fetch the next one
Indirect addressing
Interrupts checking
Interrupt handling
19Instruction fetching cycle
Address bus
Control bus
Data bus
Processor
MAR
PC
Memory
CU
IR
MBR
20Indirect mode
Address bus
Control bus
Data bus
Processor
MAR
Memory
CU
MBR
21Interrupt mode
Address bus
Control bus
Data bus
Processor
MAR
PC
Memory
CU
MBR
22Pipeline
- Problem during the instruction cycle only one
instruction is processed - Solution divide the cycle into smaller fragments
- Condition time instants, when no main memory
access is required!
Cycle 1 Cycle 2
Cycle 3
23Pipeline example - laundry
3 hours / cycle 9 hours for all
LA DR PA LA DR PA LA DR
PA
CYCLE 1 CYCLE 2
CYCLE 3
3 hours / cycle 5 hours for all !!
LA DR PA
LA DR PA
LA DR PA
24Prefetch
Instruction
Instruction
Result
Instruction fetch
Execution
New address
Waiting
Waiting
Instruction
Instruction
Result
Instruction fetching
Execution
Denial
- NOTE acceleration is smaller than double, as the
memory access lasts longer than the instruction
execution
25Basic phases of the instruction cycle
- Instruction fetching (FI)
- Instruction decoding (DI)
- Operands calculation (CO)
- Operands fetching (FO)
- Instruction execution (EI)
- Writing outcome (WO)
1 2 3 4 5 6 7 8
9 10 11
FI DI CO FO EI WO
I1 I2 I3 I4
FI DI CO FO EI WO
FI DI CO FO EI WO
FI DI CO FO EI WO
26Branches and pipelining
1 2 3 4 5 6 7 8
9 10 11 12 13
FI DI CO FO EI WO
I1 I2 I3 I4 I5 I6 I21 I22
FI DI CO FO EI WO
FI DI CO FO
FI DI CO
FI DI
FI
FI DI CO FO EI WO
FI DI CO FO EI WO
27Pipeline implementation algorithm
28Problems of the pipelining
- Subsequent pipe phases dont last the same amount
of time - Transferring data between the buffers may
significantly increase pipeline execution time - Dependency between the registers and memory in
the pipeline optimization may be minimized with
high stakes
29Efficiency of the pipelining
Cycle execution time Time required to execute
all the instructions Instruction pipeline
acceleration ratio
30Example of the pipeline efficiency
31Modern Processors Pipelines
- Pentium 3 10 stages
- Athlon 10 stages for ALU, 15 stages for FPU
- Pentium M 12 stages
- Athlon 64/ 64 X2 12 stages for ALU, 17 stages
for FPU - Pentium 4 Northwood 20 stages (hyperpipeline!!)
- Pentium 4 Prescott 31 stages
- Core2Duo 14 stages
32Hazards
- They are pipelining disturbances
- There are data, resources and control hazards
33Branch handling
- Pipeline multiplication
- Prefetch of the instruction
- Loop buffer
- Branch prediction
- Delayed branch
34Multiplied pipelining
- Both instructions for simultaneous processing as
a result of branch are loaded into two pipelines - The main problem is to gain memory access for
both instructions
35Prefetch and loop buffer
Prefetch
- When branch instruction is decoded, the target
instruction is fetched. It is stored until the
branch is executed
Loop buffer
- A buffer in memory to store the subsequent
instructions is created - It is useful when there are conditional branch
instructions and loops involved
36Conditional Branch Prediction
- Static
- Never occuring branch (Sun SPARC, MIPS)
- Always occuring branch
- Operation code prediction
- Dynamic
- Occured/Didnt occur switch
- Branch history table
37Static prediction
- The simplest, used as the fallback method, for
instance in the Motorola MPC7450 processor - Pentium 4 allowed inserting the code suggesting
if the static prediction should point at the
branch or not (so-called prediction hint)
38Dynamic prediction of the conditional branches
- A conditional branch instruction history is
stored - It is represented by the bits stored in the cache
memory - Every instruction has its own history bits
- Another solution is the table storing
informations about the conditional branch result
39History bits prediction
40Branch history table
Branch instruction address History bits Target instruction
41Local Branch Prediction
- Requires a separate history buffer for each
instruction, although the history table can be
common for all instructions - Pentium MMX, Pentium 2 i 3 processors have local
prediction circuits with 4 history bits and 16
positions for every type of instruction - Local prediction efficiency is estimated at 97
42Global Branch Prediction
- A common history for all branch instructions is
stored in memory. It allows to consider
dependencies between different branch
instructions - Rarely a better solution than the local
prediction - Hybrid solutions shared unit of the global
prediction and the history table (AMD processors,
Pentium M, Core, Core 2)
43Branch Prediction Unit
- A processor circuit responsible for prediction of
the disturbances in the sequential code execution - Often connected with the microoperation cache
memory - In Pentium 4 processor, the buffer for the branch
prediction has 4096, in Pentium 3 only 512.
Therefore the former has a 33 percent better hit
ratio than the latter
44Location of the Branch Prediction Unit