Title: Kiloinstruction Processors
1Kilo-instruction Processors
Mateo Valero, UPC HPCA-10, Madrid February
14-17th 2004
2Motivation
Technology works against ILP Faster clock rates
gt Lower ILP
Justin Rattner, Intel-MRL, Keynote lecture,
Micro-32
3The trends are changing
- 1990s architecture
- Short pipelines
- Low memory latencies
- 2010 architectures
- Long pipelines
- 30-50 stages
- Power-Thermal-Wire delay aware architecture
- Long memory latencies
- 500 to 1000 cycles
- ISCA-2003 50 to 160
M. Valero. NSF Workshop on Computer Architecture.
ISCA Conference. San Diego, June 2003
4Memory Wall Problem
0.6X
0.45X
Memory latency has enormous impact on IPC
M. Valero. NSF Workshop on Computer Architecture.
ISCA Conference. San Diego, June 2003
5Reducing Memory Latency
- Technology
- Caches
- Prefetching
- Hardware, Software and combined
- Assisted/SSMT Threads
- Kilo-instruction Processor
6Kilo-instruction Processors
- Our goals
- Better tolerate increasing memory latency
- Further improve ILP, even for such longer memory
latency - Allow additional optimizations enabled by the new
architecture (See below) - Our proposal Kilo-instruction Processors
- Out-Of-Order processors with thousands of
instructions in-flight (Very Large Instruction
Windows) - Intelligent use of resources (Resource
requirements growing much slower than window size)
7Kilo-instruction Processsor
- It is not..
- A heavy processor ?
- Cyber-205 like processor
- Vector Processor
- Blue-Gene like
- Multiscalar,Trace Processor
- Raw, Imagine, Levo,TRIPS
- It is .
- An Affordable O-O-O Superscalar Processor having
Thousands of In-flight Instructions -
8Outline
- Motivation
- Increasing the number of in-flight instructions
- Kilo-instruction Processor Ingredients
- Multi-Checkpointing the ROB
- Out-of-Order Commit
- Early Release of Resources
- Ephemeral Registers
- Load Queues
- Locality Exploitation
- Instruction Queues
- LSQ
- Cross-pollination with other techniques
- Kilo-processor and multiprocessor systems
- kilo-vector processor
- Kilo-SMT processor
- Further Improvements
- Branch prediction
- kilo-valpred processor
9ROB Activity
ROB
Register File
load 1
x
a
x
branch 1
x
branch
x
x
load 2
IQ
x
b
x
load 1
branch 3
a
x
branch 1
x
load 2
128-entry
b
branch 3
1024-entry
M. Valero. NSF Workshop on Computer Architecture.
ISCA Conference. San Diego, June 2003
10Integer, 8-way, L2 1MB
1.22X
1.1X
1.86X
0.6X
1.41X
Research Proposal to Intel (July 2001) and
presentation to Intel-MRL Feb. 2002 Cristal et
al. Large Virtual ROBs by Processor
Checkpointing, TR UPC-DAC, July 2002 M. Valero.
NSF Workshop on Computer Architecture. ISCA
Conference. San Diego, June 2003
11Floating-point, 8-way, L2 1MB
2.34X
2X
4.58X
3.91X
0.45X
Research Proposal to Intel (July 2001) and
presentation to Intel-MRL Feb. 2002 Cristal et
al. Large Virtual ROBs by Processor
Checkpointing, TR UPC-DAC, July 2002 M. Valero.
NSF Workshop on Computer Architecture. ISCA
Conference. San Diego, June 2003
12Scalability
- Thousands of In-flight Instructions and In-Order
Commit make designs impractical - ROB Needs to maintain a copy of every in-flight
instruction - IQs Instructions depending on long latency
instructions remain in these queues for a long
time - LSQs Instructions remain in the queue until
commit - Registers A new physical register for each
instruction producing a new value - We would like to get the IPC of thousands of
instructions in-flight without drastically
increasing resource requirements
M. Valero. NSF Workshop on Computer Architecture.
ISCA Conference. San Diego, June 2003
13Late Allocation/Early Release of Registers
Register File
ROB
Virtual Registers
load 1
x
R1, R2
a
a
x
a, b, c
branch 1
R1 ?
R1 ?
x
IQ
Early Release
R2 ?
R2 ?
branch 2
load 1
load 2
a
branch 1
x
load 2
b
b
b
x
c
branch 3
x
c
c
Monreal et al. Delaying physical register
allocation through virtual-physical registers,
MICRO99 T. Monreal et al., Late allocation and
early release of physical registers, IEEE-TC (to
appear)
14Nearby Distant Parallelism
ROB
Register File
load
Nearby
X
load
f(X)
branch
load
Distant
Speculative Replayable
branch
load
branch
Balasubramonian et al. Dynamically Allocating
Processor Resources, ISCA01
15Dynamic Vectorization
load
ROB_head
register file
br
C.I. 1
C.I. 2
ROB_tail
C.I.1 C.I.2
A. Pajuelo et al. Control-Flow Independence
Reuse via Dynamic Vectorization, UPC-DAC
16Outline
- Motivation
- Increasing the number of in-flight instructions
- Kilo-instruction Processor Ingredients
- Multi-Checkpointing the ROB
- Out-of-Order Commit
- Early Release of Resources
- Ephemeral Registers
- Load Queues
- Locality Exploitation
- Instruction Queues
- LSQ
- Cross-pollination with other techniques
- Kilo-processor and multiprocessor systems
- kilo-vector processor
- Kilo-SMT processor
- Further Improvements
- Branch prediction
- kilo-valpred processor
17Checkpointing the ROB
- Checkpointing to support precise exceptions
- Quite well established and used technique
- W.M.Hwu and Y.N.Patt, ISCA 1987
- Checkpointing to early release resources
- Quite recent concept
- Cherry J. Martínez et al., MICRO, Nov. 2002
- Large VROB A. Cristal et al. TR-UPC-DAC, July
2002
M. Valero. NSF Workshop on Computer Architecture.
ISCA Conference. San Diego, June 2003
18Cherry
ROB
load
Early Release
Cherry
irreversible
Point of no return (PNR)
branch
reversible
Martínez et al. Cherry Checkpointed Early
Resource Recycling, MICRO02
19Multi-Checkpoint
ROB
Checkpointing Table
Checkpoint 1
Checkpoint 2
branch 2
load 1
load 1
x
load 1 PC, status, counter,
a
branch 2 PC, status, counter,
x
branch 1
Gang commit Checkpoint 1
OOO commit
x
branch
x
x
branch 2
x
b
IQ
x
load 3
x
x
Cristal et al. Large Virtual ROBs by Processor
Checkpointing, TR UPC-DAC, July 2002 Research
Proposal to Intel (July 2001) and presentation
to Intel-MRL Feb. 2002
20Outline
- Motivation
- Increasing the number of in-flight instructions
- Kilo-instruction Processor Ingredients
- Multi-Checkpointing the ROB
- Out-of-Order Commit
- Early Release of Resources
- Ephemeral Registers
- Load Queues
- Locality Exploitation
- Instruction Queues
- LSQ
- Cross-pollination with other techniques
- Kilo-processor and multiprocessor systems
- kilo-vector processor
- Kilo-SMT processor
- Further Improvements
- Branch prediction
- kilo-valpred processor
21Early Release of Resources
Commit
Memory Latency i.e, 1000 cycles
Fetch
T. Karkhanis and J.Smith, A day in the life of a
data cache miss Workshop Memory Performance
Issues. ISCA-2002M. Valero. NSF Workshop on
Computer Architecture. ISCA Conference. San
Diego, June 2003
22Registers
- Register File is a critical component of a modern
superscalar processor - Large number of entries to support out-of-order
execution and memory latency - Large number of ports to increase issue width
- Power and access time are key issues for register
file design - It is always beneficial, to reduce the number of
physical registers
23Physical Registers
- Conventional renaming scheme
- Virtual-Physical Registers
- Early Release
- Ephemeral Registers checkpoint virtual-physical
Register Unused
Register Used
Register Unused
Register Used
Register Unused
Register Unused
Register Used
Register Used
T. Monreal et al. Delaying physical register
allocation through virtual-physical registers,
MICRO99 M. Moudgill et al, Register renaming
and dynamic speculation an alternative
approach, MICRO93 T. Monreal et al., Late
allocation and early release of physical
registers, IEEE-TC (to appear) J. Martínez et
al, Ephemeral Registers, Technical Report
CSL-TR-2003-1035 , 2003
24 State of Registers (FP, ROB2048)
A. Cristal, et al, A case for resource-concious
out-of-order processors, IEEE TCCA CA Letters,
Vol. 2, Oct. 2003
25Outline
- Motivation
- Increasing the number of in-flight instructions
- Kilo-instruction Processor Ingredients
- Multi-Checkpointing the ROB
- Out-of-Order Commit
- Early Release of Resources
- Ephemeral Registers
- Load Queues
- Locality Exploitation
- Instructions Queues
- LSQ
- Cross-pollination with other techniques
- Kilo-processor and multiprocessor systems
- kilo-vector processor
- Kilo-SMT processor
- Further Improvements
- Branch prediction
- kilo-valpred processor
26IQs and Kilo processors
- Increasing the number of IQ entries increase the
power, area and access time - Wake-up and selection logic need to be done
efficiently - Kilo-instruction processors may have many
in-flight instructions - We need new organization for the IQs in order to
have affordable kilo-instruction processors
27Execution Time of Instructions
- Lebeck et al., A large, fast instruction window
for tolerating cache misses, ISCA-29, 2002. - Brekelbaum et al., Hierarchical scheduling
windows, ISCA-35, 2002. - Cristal et al., Out-of-Order Commit Processors,
TR UPC-DAC-2003-44, July 2003 HPCA-10, Feb.
2004
ROB
Secondary Buffer
2
3
1
IQ
3
1
28Load/Store Queues
-
- Efficient and affordable memory disambiguation is
mandatory for kilo-instruction processors - We need to guarantee that loads and stores arrive
to the memory in the correct order - Increasing the number of in-flight instructions,
can make the load/store queues a true bottleneck
both in latency and power
29 State of LD Queues (specFP, ROB2048)
A. Cristal, et al, A case for
resource-conscious out-of-order processors, IEEE
TCCA CA Letters, Vol. 2, October 2003
30 State of ST Queues (specFP, ROB2048)
A. Cristal, et al, A case for
resource-conscious out-of-order processors, IEEE
TCCA CA Letters, Vol. 2, October 2003
31Search Filtering
- Determine independence without associative search
on addresses - Use Bloom Filter to control associative search
- Approximate tracking (false positives are
possible) - No false negatives gt no mispredictions
Associatively search If hashed bit is set to 1
Filter
S. Sethumadhavan et al. Scalable Hardware Memory
Disambiguation for High ILP Processors Micro-36,
2003
32Putting It All Together
PhysicalRegisters
Virtual Registers
Memory Latency
IQs of 128 entries
A. Cristal et al. Kilo-instruction Processors.
Invited paper. ISHPC-V.Tokyo, LNCS-2858. October
20-22th, 2003
33Outline
- Motivation
- Increasing the number of in-flight instructions
- Kilo-instruction Processor Ingredients
- Multi-Checkpointing the ROB
- Out-of-Order Commit
- Early Release of Resources
- Ephemeral Registers
- Load Queues
- Locality Exploitation
- Instructions Queues
- LSQ
- Cross-pollination with other techniques
- Kilo-processor and multiprocessor systems
- kilo-vector processor
- Kilo-SMT processor
- Further Improvements
- Branch prediction
- kilo-valpred processor
34Kilo-processor and multiprocessor systems
First results Ideal Network
M. Galluzzi et al. A First glance at
Kiloinstruction Based Multiprocessors Invited
Paper. ACM Computing Frontiers Conference.
Ischia, Italy, April 10-12, 2004
35Kilo-processor and multiprocessor systems
Impact of the network-ROB 64
M. Galluzzi et al. A First glance at
Kiloinstruction Based Multiprocessors Invited
Paper. ACM Computing Frontiers Conference.
Ischia, Italy, April 10-12, 2004
36Kilo-processor and multiprocessor systems
First Results
M. Galluzzi et al. A First glance at
Kiloinstruction Based Multiprocessors Invited
Paper. ACM Computing Frontiers Conference.
Ischia, Italy, April 10-12, 2004
37Kilo-processor and multiprocessor systems
Network latency, Radix, 250 cyc. latency
M. Galluzzi et al. A First glance at
Kiloinstruction Based Multiprocessors Invited
Paper. ACM Computing Frontiers Conference.
Ischia, Italy, April 10-12, 2004
38Kilo-vector processor
20
80
Program
Vector
20
8
Program
Speedup 3.5
Kilo
5
8
Program
Speedup 7.7
F. Quintana et al, Kilo-vector processors,
UPC-DAC
39Outline
- Motivation
- Increasing the number of in-flight instructions
- Kilo-instruction Processor Ingredients
- Multi-Checkpointing the ROB
- Out-of-Order Commit
- Early Release of Resources
- Ephemeral Registers
- Load Queues
- Locality Exploitation
- Instructions Queues
- LSQ
- Cross-pollination with other techniques
- Kilo-processor and multiprocessor systems
- kilo-vector processor
- Kilo-SMT processor
- Further Improvements
- Branch prediction
- kilo-valpred processor
40Kilo-valpred processor
T. Ramírez et al. Kilo-value prediction
processor UPC-DAC
41Kilo and Control Independence
- More opportunities to find control independent
instructions - Squash reuse
- Control-independent instruction
- reexecution removal
- Savings
- Power/energy
- Execution bandwidth
- Resources
- Helps to go far ahead in the instruction window
faster
42UPC contribution to kilo processors
- We started our work in June 2001
- Grant proposal to Intel-MRL (Konrad Lai and Ronny
Ronen) in January 28th. 2002 - Presentation to Intel-MRL in February 2002
- A. Cristal, et al. Large virtual ROBs by
processor checkpointing Technical Report
UPC-DAC-2002-39, July 2002. (Rejected for
Micro-2002) - Multiple Checkpointers
- Out-of-order Commit, No need for ROB
- Early release of registers and loads
- A. Cristal and M. Valero, ROBs virtuales
utilizando checkpointing. Spanish Workshop on
Parallelism. Lleida, Sept., 2002 - Same as the previous report, but in Spanish
- A. Cristal, J. Martínez, M. Valero and J. Llosa,
Ephemeral Registers, Technical Report
CSL-TR-2003-1035 , 2003. Rejected for ISCA 2003
and Micro 2003 - Ckeckpoint Early Release Late allocation of
registers - Presentation to Intel-MRL in March 2003
- A. Cristal, J. Martínez, J. LLosa and M. Valero,
A case for resource-conscious out-of-order
processors, IEEE TCCA Computer Architecture
Letters, Vol. 2, October 2003 - Underutilization of resources
43UPC contribution to kilo processors
- A. Cristal, et al. A case for
resource-conscious out-of-order processors
Towards Kilo-instruction in-flight processors.
MEDEA Workshop, Sept 2003 and ACM-CAN, March 2004 - A. Cristal et al. Kilo-instruction Processors.
Invited paper. ISHPC-V.Tokyo, LNCS-2858. October
20-22th, 2003 - A. Cristal et al. Future ILP Processors.
Invited paper. IJHPCN, to be published - A. Cristal, et al. Out-of-Order Commit
Processors Technical Report UPC-DAC-2003-44,
July 2003. HPCA-10, Madrid, Feb. 2004 - Remove-Reinsert Mechanism
- Simple reinsert mechanism
- M. Galluzzi et al. A First glance at
Kiloinstruction Based Multiprocessors Invited
Paper. ACM Computing Frontiers Conference.
Ischia, Italy, April 10-12, 2004 - Much new work done at this moment
44Talks about Kilo processors, from UPC
- Presentation in Barcelona, to Intel-MRL in
February 2002 - Spanish Workshop on Parallelism. Lleida, Sept.,
2002 - Presentation to Intel-MRL in March 2003
- Invited presentation. NSF Panel On the Future
of Computer Architecture Research Wise Views and
Fresh Perspectives. San Diego, June 2003 - Invited Lecture. PA3CT Conference. Edegem,
Belgium, September 22-23, 2003 - MEDEA Workshop. New Orleans, September 2003
- Invited Lecture. ISHPC-V. The 5th International
Symposium on High Performance Computing. Tokyo,
Japan, October 20-22, 2003 - Keynote lecture. Seminar on Compilers and
Architecture. IBM Haifa. November 11th., 2003. - Invited lecture. Intel MRL. Haifa., Israel. Nov.
12th., 2003 - HPCA-10, Madrid, February 14-18, 2003
- Keynote lecture. HPCA-10. Madrid, February 14-18,
2003 - Invited lecture. ACM Computing Frontiers. Ischia,
April, 2004 - ACM Invited lecture. ENCAR México, May 2004
- More future presentations scheduled
45Memory Latency
- Jouppi and P. Ranganathan. The relative
importance of memory latency, bandwidth and
branch prediction Whorkshop on Mixing Logic and
DRAM Chips that compute and remember, during
ISCA-24, 1997 - S. Srinivasan and A. Lebeck, Load latency
tolerance in dynamically scheduled processors,
Micro-31, 1998 - K. Skadron, P. Ahuja, M. Martonosi and D. Clark
Branch prediction, instruction window size and
cache size Performance tradeoffs and simulation
techniques IEEE-TC, pp. 1260-1281, 1999.
46Large Reorder Buffers
- G. Sohi, S. Breach, and T. N. Vijaykumar
Multiscalar processors ISCA-22, 1995. - E. Rotenberg, Q. Jacobson, Y. Sazeides, and J.
Smith Trace processors ISCA-24, 1997 - H. Akkari and M. Driscoll A dynamic
multithreaded processor Micro-31, 1998 - R. Balasubramonian, S. Dwarkadas, and D.
Albonesi.Dynamically allocating processor
resources between nearby and distant ilp ISCA,
June 2001. - Save some resources allocated for eager
execution - P. Ranganathan, V. Pai, and S. Adve Using
speculative retirement and large instruction
windows to narrow the performance gap between
memory consistency models SPAA, 1997 - J. M. Tendler, S. Dodson, S. Fields, H. Lee, and
B. Sinharoy Power4 System Microarchitecture
IBM Journal of Research and Development, pp.
5-25, January 2002.
47Checkpointing
- W.M. Hwu and Y. N. Patt, Checkpoint repair for
out-of-order execution machines ISCA-14, 1987. - Checkpointing as a recovery mechanism
- Early Release of Resources
- A. Cristal, M. Valero, and J. LLosa. Large
virtual ROBs by processor checkpointing
Technical Report UPC-DAC-2002-39, July 2002. - Multiple Checkpointers
- Out-of-order Commit, No need for ROB
- Early release of registers and loads
- J.F. Martínez, J. Renau, M.C. Huang, M.
Prvulovic, and J. Torrellas. Cherry checkpointed
early resource recycling in out-of-order
microprocessors. MICRO-35, Nov. 2002. - One checkpoint
- Early release of resources
48Register File
- M. Moudgill and K. Pingali and S. Vassiliadis,
Register renaming and dynamic speculation an
alternative approach, In Proceedings of the 26th
annual international symposium on
Microarchitecture, 1993. - Early Release of Registers
- T. Monreal, A. González, M. Valero, J. González,
V. Viñals, Delaying Physical Register Allocation
through Virtual-Physical Registers, In
Proceedings of the 33th annual international
symposium on Microarchitecture, 1999. - Virtual Registers, Late allocation of registers
- A. Cristal, J. Martínez, M. Valero and J. Llosa,
Ephemeral Registers, Technical Report
CSL-TR-2003-1035 , 2003. - Ckeckpoint Early Release Late allocation of
registers - T. Monreal et al., Late allocation and early
release of physical registers, IEEE-TC (to
appear)
49Instruction Queues
- S. Palacharla, N.P. Jouppi, and J.E. Smith
Complexity-effective superscalar processors
ISCA-24, 1997. - Divide the Instruction queues in a set of FIFO
queues - A.R. Lebeck, J. Koppanalil, T. Li, J. Patwardhan,
and E. Rotenberg A large, fast instruction
window for tolerating cache misses ISCA-29,
2002. - Remove-Reinsert Mechanism
- Keep the load dependence of all instructions
- E. Brekelbaum, J. Rupley, C.Wilkerson, and B.
Black Hierarchical scheduling windows ISCA-35,
2002. - Two clusters, a slow/big one, and a faster/small
one for critical instructions - A. Cristal, D. Ortega, J. Llosa and M. Valero
Out-of-Order Commit Processors Technical Report
UPC-DAC-2003-44, July 2003. HPCA-10, Madrid, Feb.
2004 - Remove-Reinsert Mechanism
- Simple reinsert mechanism
50References for LSQ for Large ROB
- A. Cristal, M. Valero, and J. LLosa. Large
virtual ROBs by processor checkpointing
Technical Report UPC-DAC-2002-39, July 2002 - J.F. Martínez, J. Renau, M.C. Huang, M.
Prvulovic, and J. Torrellas. Cherry
checkpointed early resource recycling in
out-of-order microprocessors. MICRO-35, 2002 - H. Akkari, R. Rajwar and S. T. Srinivasan
Checkpointing Processing and Recovery Towards
Scalable Large Instruction Window Processors
Micro-36, 2003 - S. Sethumadhavan, R. Desikan, D. Burger, C.R.
Moore and S. W. Keckler Scalable Hardware Memory
Disambiguation for High ILP Processors Micro-36,
2003
51Conclusion
- Affordable Kilo-instruction Processors
- Checkpointing and resource-conscious
architectures - Out-of- order commit
- Ephemeral registers
- Two-level instruction queues
- Early release of loads
- Load/store queue management
- New ideas to watch for
- Better branch predictors
- Predication and Multi-path execution
- Control and data independent instructions
- Reuse of large blocks of instructions
- New processor paradigms
- Kilo-based multiprocessor systems
- Kilo-vector processors
- Kilo-SMT processors
- Kilo-valpred processors
52Acknowledgments
- Yale Patt
- Alex Veidenbaum
- Guri Sohi
- Mark Hill
- Wen-mei Hwu
- Mon Beivide
- Valentín Puente
- José Angel Gregorio
- Teresa Monreal
- Victor Viñals
- Intel, Konrad Lai and Ronny Ronen
- Adrián Cristal
- José Martínez
- Josep Llosa
- Daniel Ortega
- Fran Cazorla
- Enrique Fernández
- Ayose Falcón
- Alex Pajuelo
- Marco Galluzzi
- Tanausu Ramírez
- Jim Smith
53 54Processor-DRAM Gap (latency)
µProc 60/yr.
1000
CPU
Moores Law
100
Processor-Memory Performance Gap(grows 50 /
year)
Performance
10
DRAM 7/yr.
DRAM
1
1980
1981
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
1982
Time
D.A. Patterson New directions in Computer
Architecture Berkeley, June 1998
55Runahead Execution
ROB
L2 cache miss
Checkpoint
load 1
x
INV
a
x
INV
- generate bogus value
- invalidate dep. registers
- continue execution
-
branch 1
INV
Runahead Mode
x
branch
x
x
load 2
INV
x
b
x
INV
- Virtually increments ROB size
- Prefetch data of future loads
-
branch 3
x
x
Mutlu et al. Runahead Execution An
Alternative, HPCA03
56Kilo and Control Independence
- Larger windows improve
- The probability of finding the
- reconvergence point
- The correct detection of control
- independent instructions because the wrong path
is completely executed - The execution of more control
- independent instructions for later reuse
Wrong path
Correct path
current instruction windows
RP
CI
kilo-instruction windows
57Kilo and Control Independence
- The larger the window the more opportunities to
find the reconvergence point.
Current instruction windows
58Grant Proposal to Intel January 28th, 2002
- In the first semester we worked on the smart
register file and the associated ISA, and
evaluated the proposed architecture with a few
kernels. We showed speedups around 20 in the
tested kernels. At the end of the first semester,
we began the work on wide registers.From April to
August 2001, we have been investigating three
different approaches to use register files with
wide ports (i.e. ports that allow to read various
consecutive registers in a single access). The
first one was trying to find subgraphs in the
data dependence graph that have the same shape.
The drawback of this approach is that it requires
to move loads above stores in order to have a
significant coverage. Some type of dependence
speculation that adds a non-negligible complexity
is required. We also did a study of the potential
to exploit wide registers by looking at
instructions in a window of 32 instructions. For
Spec95 programs, we obtained that 48.9 and 52.3
of the operands were not wide for integer and
FP codes respectively. We continue working on an
approach that tries to group the two values of
all two-operand instructions in a single wide
register. - Since August 2001, we have been working on
committing instructions out of order that allows
to free in advance processor resources and to
continue the execution of new instructions. The
main idea is as follows when the processor finds
an old instruction in the ROB with a large
latency and the ROB is full, the processor
removes this instruction by checkpointing the
state of the processor at the last committed
instruction. The processor continues its work
normally and it moves all instructions that
depend on the checkpointed instruction, to the
checkpointing table. In case of misspeculation or
an exception of either the checkpointed
instruction or any instruction dependent on it,
the checkpointed state is restored. The design of
the mechanism is still in progress. We are
building a simulation environment that will
permit us to evaluate the proposal. - The work we plan to do during this year
concerning to the out-of-order commit mechanism
is the following - To finish the simulator to start with the
evaluation of different alternatives for the
implementation of the out-of-order commit
mechanism - To optimize the mechanism for those branches
where the branch predictor fails frequently. - To study new organizations for the load-store
queues. - To use the concept of virtual registers to
optimize the register file organization. - Concerning to the work dealing with wide
registers, we are going to finish the design of
the mechanism and to evaluate it.
59Grant Proposal to Intel January 28th, 2002
- Since August 2001, we have been working on
committing instructions out of order that allows
to free in advance processor resources and to
continue the execution of new instructions. The
main idea is as follows when the processor finds
an old instruction in the ROB with a large
latency and the ROB is full, the processor
removes this instruction by checkpointing the
state of the processor at the last committed
instruction. The processor continues its work
normally and it moves all instructions that
depend on the checkpointed instruction, to the
checkpointing table. In case of misspeculation or
an exception of either the checkpointed
instruction or any instruction dependent on it,
the checkpointed state is restored. The design of
the mechanism is still in progress. We are
building a simulation environment that will
permit us to evaluate the proposal. - The work we plan to do during this year
concerning to the out-of-order commit mechanism
is the following - To finish the simulator to start with the
evaluation of different alternatives for the
implementation of the out-of-order commit
mechanism - To optimize the mechanism for those branches
where the branch predictor fails frequently. - To study new organizations for the load-store
queues. - To use the concept of virtual registers to
optimize the register file organization. - Concerning to the work dealing with wide
registers, we are going to finish the design of
the mechanism and to evaluate it.