Computer Architecture - PowerPoint PPT Presentation

1 / 15

About This Presentation

Title:

Computer Architecture

Description:

... is an absolutely stupendous manual devoted to the Pentium III ... http://developer.intel.com/design/pentiumii/manuals/245127.htm ... In both these manuals, ... – PowerPoint PPT presentation

Number of Views:27

Avg rating:3.0/5.0

Slides: 16

Provided by: jb20

Category:

more less

Transcript and Presenter's Notes

Title: Computer Architecture

1
Computer Architecture

Lab 5.1
Prof. Jerry Breecher
CSCI 240
Fall 2001

2
What you will do in this lab.

The purpose of this lab is to let you use some of
the concepts youve acquired about Pipelines.
You will be examining the code produced by the
compiler and better understand what it can do.
You have only one task before you
Use the tool provided, called soak.c, to
determine properties of the memory subsystem.
Using this tool, and a lot of ingenuity, you can
find out the following information
Size of the L1 cache
Size of the L2 cache
Data Access speed of the L1 cache
Data Access speed of the L2 cache
Data Access speed of the Main Memory
The level of associativity of the L1 and L2
caches.
The time lost due to a TLB miss.
The level of associativity of the TLB.
The time required for a mis-aligned data read.

Wow! Is this open ended or what!!
3
What you will do in this lab.

What is a verbal lab?
You prepare, document and tie up all the pieces
of your lab just as if you were handing it in.
Instead you and your teammate talk over the
results with me. The discussion will be
professional, the way I would talk about a
problem with a junior colleague.
You are expected to bring to this discussion
All notes youve written about the problem. It
is NOT acceptable to say I think the answer was
42.
Many of you at the last verbal discussion said
things like I dont know why the answer came out
the way it did. This again is not the way a
professional discussion takes place. You are
expected to have your facts straight and to
understand what it is you accomplished. Your job
is to tame this piece of silicon and know what
its doing.

4
Where To Get Documentation

There is an absolutely stupendous manual devoted
to the Pentium III architecture (which is what we
have in the lab)
The Intel Architecture Optimization Reference
Manual
http//developer.intel.com/design/pentiumii/manual
s/245127.htm
Local copy at
http//babbage.clarku.edu/jbreecher/docs/Intel
Architecture Optimization Reference Manual.pdf
For Pentium 4, the manual is
Pentium 4 and Xeon Processor Optimization
http//developer.intel.com/design/pentium4/manuals
/248966.htm
Local copy at
http//babbage.clarku.edu/jbreecher/docs/Pentiu
m 4 Xenon Processor Optimization.pdf
In both these manuals,
Chapter 1 contains lots of great information
about The Pipeline used in the processors.
Chapter 2 contains guidelines for Optimizing
Performance.
There are also excellent coding examples
throughout.

5
Task 1

Steps To Accomplish This Task.
1. There is no way you can do this task until you
have a complete and thorough understanding of the
memory hierarchy. You can gain this
understanding in the lecture or by reading the
book.
2. Develop a plan! This is a big undefined
project. You need to figure out a plan for each
of the pieces. I will actually do Parts A and E
as examples for you.
3. Sit and think. What are the inputs you want
to use for Part B? Try them. Then sit and think
some more. Do your results match your picture of
how the cache works?
You will be evaluated on your methodology!

6
About Soak

Say soak and you will get lots of information
about the program. Heres a bit about the
inputs
soak lttotal_memgt ltstepgt ltMega-touchesgt
Total memory is the span of memory to be touched.
Step is the number of bytes jumped on each memory
touch.
Mega-touches is how many million memory reads you
will do.
For example, soak 128 32 100 touches
memory locations
0, 32, 64, 96, 0, 32, 64, 96, with a total of
100,000,000 touches since there are a total of
4 touches in a cycle, there will be a total of
25,000,000 cycles.
Heres where to find the code for soak.c
http//babbage.clarku.edu/jbreecher/docs/soak.c
You may wish to read this code. Its bigger than
the throwaway tidbits youve seen so far.

7
About Soak

The relevant part of the soak program
get_current_time( start_seconds )
for ( j 0 j lt iterations j )
? The Outer Loop
for ( i 0 i lt steps_per_iteration i )
? The Inner Loop
/ A "touch" is defined as one cycle
within this loop /
new_ptr (STRUCT )( (int)memory_ptr
global (step_size i) )
global new_ptr -gt trash
get_current_time( end_seconds )

8
About Soak

The relevant part of the soak program
movl global,ecx
movl new_ptr,eax
.p2align 4,,7
.L188
leal 1(edx),edi
cmpl 0,-48(ebp)
jle .L187
xorl ebx,ebx
movl -48(ebp),edx
.p2align 4,,7

.L192 leal (ecx,esi),eax addl
ebx,eax movl (eax),ecx addl
-44(ebp),ebx decl edx jnz
.L192 .L187 movl edi,edx cmpl
-32(ebp),edx jl .L188 movl
eax,new_ptr movl ecx,global
9
About Soak

The relevant part of the soak program
.L188
leal 1(edx),edi lt-- Outer Loop
cmpl 0,-48(ebp) lt-- Outer Loop
jle .L187 lt-- Outer Loop
xorl ebx,ebx lt-- Outer Loop
movl -48(ebp),edx lt-- Outer Loop
.L192
leal (ecx,esi),eax lt-- Inner
Loop
addl ebx,eax lt-- Inner Loop
movl (eax),ecx lt-- Inner Loop lt--
Memory Touch
addl -44(ebp),ebx lt-- Inner Loop
decl edx lt-- Inner Loop
jnz .L192 lt-- Inner Loop
.L187
movl edi,edx lt-- Outer Loop
cmpl -32(ebp),edx lt-- Outer Loop
jl .L188 lt-- Outer Loop

10
Other Support Material

Useful Rabbit Codes
L2_LINES_IN Number of lines allocated (loaded
into) in L2. These are requests that miss the L2
and go to main memory.
L2_RQSTS Numbers of requests to the L2 cache
this includes both requests that hit the cache
and those that miss and must then go to main
memory.
INST_RETIRED Number of instructions retired.
MISALIGN_MEM_REF Number of instructions that
accessed memory not on the correct mod boundary.
This causes the hardware to do extra work to
bring the data in.
There are also the various rabbit codes youve
used in previous labs.
rabbit soak .. produces all codes. rabbit -g
2 soak . gets most of these codes.
These types are more fully described in
IA32 SDM Vol3 System Programmers Guide.pdf
starting on page A-22.

11
Other Support Material

How To Write A Shell Command
In planning your tests, its easier to write a
shell script as a way of remembering what you did
and as a way of repeating some or all of an
experiment. Here are the steps you might follow
vi my_commands
cat my_commands
soak 2048 32 100
soak 4096 32 100
chmod 700 my_commands
my_commands

12
About The Caches

This is what I get on johnson when I run
arch_params
Cache and TLB
Instruction TLB ... 4 kb pages, 4-way set
associative, 32 entries
Data TLB .......... 4 kb pages, 4-way set
associative, 64 entries
L2 cache .......... 256 kb, 8-way set
associative, 32 byte line size
L1 instruction cache 16 kb, 4-way set
associative, 32 byte line size
L1 data cache ..... 16 kb, 4-way set
associative, 32 byte line size
I believe that at least one of these numbers is
wrong, though it could simply be the silicon
outsmarting me.
The way that data is replaced in a cache causes
results to be rather tricky. For example, lets
suppose that the L1 cache is 16,384 bytes in
size. When you run a test that asks for exactly
this much memory (soak 16384 32 100) you get one
time. In the best of all worlds, running a test
touching more memory (soak 20000 32 100) would
give a completely different time. But its not
that simple. The reason is that in the 20000
byte test, only some of the memory may be kicked
out. So some values hit in the cache and some
dont, giving an unusual timing.

13
About The Caches

Memory replacement is defined as Pseudo-LRU.
We will talk about this in class.
Its possible to get different timings for the
same test!! This happens when one time the data
fits in the cache, and another time it doesnt.
I dont know why this is.
So to get proper timings, its necessary to go to
a memory size much larger than the next smaller
cache. What do I mean by that?
Doing the TLB test is tricky (as if the other
tests arent!) You need to develop an access
pattern that touches many more pages than are in
the TLB, and compare that with a run where the
total pages touched do fit in the TLB.

14
Task 1

Heres the way to do Part E
How do you tell that youre getting the data from
one memory level rather than another? Well,
they take different amounts of time! So try
touching different amounts of memory when the
amount of memory touched no longer fits in the
cache, then the time to get that memory will be
larger than before.

soak 2048 32 100 5.5 nanoseconds soak 4096
32 100 5.4 nanoseconds soak 8192 32 100
5.2 nanoseconds soak 12288 32 100 5.4
nanoseconds soak 16384 32 100 5.3
nanoseconds soak 20480 32 100 9.2
nanoseconds soak 24576 32 100 9.2 nanoseconds
15
Task 1

Heres the way to do Part E
In this Part, youre figuring out the cost of
doing a memory access. Lets run soak in a mode
that ensures that we miss the L2 cache every time
well ask for lots of memory.
rabbit soak 2000000 32 100
Soak Version November 5, 2001
Touching 2000000 bytes of memory for 1600
iterations
2000000 bytes allocated at 0x40141000
13.749 seconds elapsed for 100000000 memory
touches
137.5 nanoseconds per touch
This says there are 100,000,000 memory touches in
13.7 seconds or 7,300,000
touches per second.
Event
Events Events/sec
----------------------------------------
---------------- ----------------
0x24 36 l2_lines_in
2414030 7323989.60
0x2e 46 l2_rqsts
2406704 7318328.50