Title: Computer Architecture
1Computer Architecture
- Lab 5.1
- Prof. Jerry Breecher
- CSCI 240
- Fall 2001
2What you will do in this lab.
- The purpose of this lab is to let you use some of
the concepts youve acquired about Pipelines.
You will be examining the code produced by the
compiler and better understand what it can do. - You have only one task before you
- Use the tool provided, called soak.c, to
determine properties of the memory subsystem.
Using this tool, and a lot of ingenuity, you can
find out the following information - Size of the L1 cache
- Size of the L2 cache
- Data Access speed of the L1 cache
- Data Access speed of the L2 cache
- Data Access speed of the Main Memory
- The level of associativity of the L1 and L2
caches. - The time lost due to a TLB miss.
- The level of associativity of the TLB.
- The time required for a mis-aligned data read.
Wow! Is this open ended or what!!
3What you will do in this lab.
- What is a verbal lab?
- You prepare, document and tie up all the pieces
of your lab just as if you were handing it in.
Instead you and your teammate talk over the
results with me. The discussion will be
professional, the way I would talk about a
problem with a junior colleague. - You are expected to bring to this discussion
- All notes youve written about the problem. It
is NOT acceptable to say I think the answer was
42. - Many of you at the last verbal discussion said
things like I dont know why the answer came out
the way it did. This again is not the way a
professional discussion takes place. You are
expected to have your facts straight and to
understand what it is you accomplished. Your job
is to tame this piece of silicon and know what
its doing.
4Where To Get Documentation
- There is an absolutely stupendous manual devoted
to the Pentium III architecture (which is what we
have in the lab) - The Intel Architecture Optimization Reference
Manual - http//developer.intel.com/design/pentiumii/manual
s/245127.htm - Local copy at
- http//babbage.clarku.edu/jbreecher/docs/Intel
Architecture Optimization Reference Manual.pdf - For Pentium 4, the manual is
- Pentium 4 and Xeon Processor Optimization
- http//developer.intel.com/design/pentium4/manuals
/248966.htm - Local copy at
- http//babbage.clarku.edu/jbreecher/docs/Pentiu
m 4 Xenon Processor Optimization.pdf - In both these manuals,
- Chapter 1 contains lots of great information
about The Pipeline used in the processors. - Chapter 2 contains guidelines for Optimizing
Performance. - There are also excellent coding examples
throughout.
5Task 1
- Steps To Accomplish This Task.
- 1. There is no way you can do this task until you
have a complete and thorough understanding of the
memory hierarchy. You can gain this
understanding in the lecture or by reading the
book. - 2. Develop a plan! This is a big undefined
project. You need to figure out a plan for each
of the pieces. I will actually do Parts A and E
as examples for you. - 3. Sit and think. What are the inputs you want
to use for Part B? Try them. Then sit and think
some more. Do your results match your picture of
how the cache works? - You will be evaluated on your methodology!
6About Soak
- Say soak and you will get lots of information
about the program. Heres a bit about the
inputs - soak lttotal_memgt ltstepgt ltMega-touchesgt
- Total memory is the span of memory to be touched.
- Step is the number of bytes jumped on each memory
touch. - Mega-touches is how many million memory reads you
will do. - For example, soak 128 32 100 touches
memory locations - 0, 32, 64, 96, 0, 32, 64, 96, with a total of
100,000,000 touches since there are a total of
4 touches in a cycle, there will be a total of
25,000,000 cycles. - Heres where to find the code for soak.c
- http//babbage.clarku.edu/jbreecher/docs/soak.c
- You may wish to read this code. Its bigger than
the throwaway tidbits youve seen so far.
7About Soak
- The relevant part of the soak program
- get_current_time( start_seconds )
- for ( j 0 j lt iterations j )
? The Outer Loop -
- for ( i 0 i lt steps_per_iteration i )
? The Inner Loop -
- / A "touch" is defined as one cycle
within this loop / - new_ptr (STRUCT )( (int)memory_ptr
global (step_size i) ) - global new_ptr -gt trash
-
-
- get_current_time( end_seconds )
8About Soak
- The relevant part of the soak program
- movl global,ecx
- movl new_ptr,eax
- .p2align 4,,7
- .L188
- leal 1(edx),edi
- cmpl 0,-48(ebp)
- jle .L187
- xorl ebx,ebx
- movl -48(ebp),edx
- .p2align 4,,7
.L192 leal (ecx,esi),eax addl
ebx,eax movl (eax),ecx addl
-44(ebp),ebx decl edx jnz
.L192 .L187 movl edi,edx cmpl
-32(ebp),edx jl .L188 movl
eax,new_ptr movl ecx,global
9About Soak
- The relevant part of the soak program
- .L188
- leal 1(edx),edi lt-- Outer Loop
- cmpl 0,-48(ebp) lt-- Outer Loop
- jle .L187 lt-- Outer Loop
- xorl ebx,ebx lt-- Outer Loop
- movl -48(ebp),edx lt-- Outer Loop
- .L192
- leal (ecx,esi),eax lt-- Inner
Loop - addl ebx,eax lt-- Inner Loop
- movl (eax),ecx lt-- Inner Loop lt--
Memory Touch - addl -44(ebp),ebx lt-- Inner Loop
- decl edx lt-- Inner Loop
- jnz .L192 lt-- Inner Loop
- .L187
- movl edi,edx lt-- Outer Loop
- cmpl -32(ebp),edx lt-- Outer Loop
- jl .L188 lt-- Outer Loop
10Other Support Material
- Useful Rabbit Codes
- L2_LINES_IN Number of lines allocated (loaded
into) in L2. These are requests that miss the L2
and go to main memory. - L2_RQSTS Numbers of requests to the L2 cache
this includes both requests that hit the cache
and those that miss and must then go to main
memory. - INST_RETIRED Number of instructions retired.
- MISALIGN_MEM_REF Number of instructions that
accessed memory not on the correct mod boundary.
This causes the hardware to do extra work to
bring the data in. - There are also the various rabbit codes youve
used in previous labs. - rabbit soak .. produces all codes. rabbit -g
2 soak . gets most of these codes. - These types are more fully described in
- IA32 SDM Vol3 System Programmers Guide.pdf
starting on page A-22.
11Other Support Material
- How To Write A Shell Command
- In planning your tests, its easier to write a
shell script as a way of remembering what you did
and as a way of repeating some or all of an
experiment. Here are the steps you might follow - vi my_commands
- cat my_commands
- soak 2048 32 100
- soak 4096 32 100
- chmod 700 my_commands
- my_commands
12About The Caches
- This is what I get on johnson when I run
arch_params - Cache and TLB
- Instruction TLB ... 4 kb pages, 4-way set
associative, 32 entries - Data TLB .......... 4 kb pages, 4-way set
associative, 64 entries - L2 cache .......... 256 kb, 8-way set
associative, 32 byte line size - L1 instruction cache 16 kb, 4-way set
associative, 32 byte line size - L1 data cache ..... 16 kb, 4-way set
associative, 32 byte line size - I believe that at least one of these numbers is
wrong, though it could simply be the silicon
outsmarting me. - The way that data is replaced in a cache causes
results to be rather tricky. For example, lets
suppose that the L1 cache is 16,384 bytes in
size. When you run a test that asks for exactly
this much memory (soak 16384 32 100) you get one
time. In the best of all worlds, running a test
touching more memory (soak 20000 32 100) would
give a completely different time. But its not
that simple. The reason is that in the 20000
byte test, only some of the memory may be kicked
out. So some values hit in the cache and some
dont, giving an unusual timing.
13About The Caches
- Memory replacement is defined as Pseudo-LRU.
We will talk about this in class. - Its possible to get different timings for the
same test!! This happens when one time the data
fits in the cache, and another time it doesnt.
I dont know why this is. - So to get proper timings, its necessary to go to
a memory size much larger than the next smaller
cache. What do I mean by that? - Doing the TLB test is tricky (as if the other
tests arent!) You need to develop an access
pattern that touches many more pages than are in
the TLB, and compare that with a run where the
total pages touched do fit in the TLB.
14Task 1
- Heres the way to do Part E
- How do you tell that youre getting the data from
one memory level rather than another? Well,
they take different amounts of time! So try
touching different amounts of memory when the
amount of memory touched no longer fits in the
cache, then the time to get that memory will be
larger than before.
soak 2048 32 100 5.5 nanoseconds soak 4096
32 100 5.4 nanoseconds soak 8192 32 100
5.2 nanoseconds soak 12288 32 100 5.4
nanoseconds soak 16384 32 100 5.3
nanoseconds soak 20480 32 100 9.2
nanoseconds soak 24576 32 100 9.2 nanoseconds
15Task 1
- Heres the way to do Part E
- In this Part, youre figuring out the cost of
doing a memory access. Lets run soak in a mode
that ensures that we miss the L2 cache every time
well ask for lots of memory. - rabbit soak 2000000 32 100
- Soak Version November 5, 2001
- Touching 2000000 bytes of memory for 1600
iterations - 2000000 bytes allocated at 0x40141000
- 13.749 seconds elapsed for 100000000 memory
touches - 137.5 nanoseconds per touch
- This says there are 100,000,000 memory touches in
13.7 seconds or 7,300,000 - touches per second.
- Event
Events Events/sec - ----------------------------------------
---------------- ---------------- - 0x24 36 l2_lines_in
2414030 7323989.60 - 0x2e 46 l2_rqsts
2406704 7318328.50
1 X
6 X