Title: Computer Organization
1Memory
- Computer Organization
- CS 140
2The Lab - Overview
- The purpose of this lab is to gain understanding
of memory and its behavior. - You will do this by running code that tickles
the processor, making it interact in a particular
way and from this you learn characteristics
about that processor. Youre doing experiments
on the hardware. - Run CPUID or CPUZ on several machines and
interpret results. - Use the supplied program to measure speed from
various storage levels. Use Vtune to analyze the
results. - Writing your own code that uses a large amount of
memory.
3Task 1 Run CPUID on three machines and
interpret results.
Theres a method of measuring the hardware
characteristics of a processor, its caches, and
memory. You are to use this method to
characterize and understand the behavior of these
machines. A sample output (happens to be for our
lab machines) is shown below. You should be able
to explain the size, entries, associativity, TLB,
L1, L2 etc in other words, you should
understand all the words that appear in this
report and why they are used where they
are. Tools are described on the next page.
Cache and TLB Instruction TLB 4K-Byte
Pages, 4-way associative, 128 entries Data TLB
.......... 4 Mb pages, 4-way associative, 32
entries 64-byte cache line prefetching 1st-level
I-cache 32K Bytes, 8-way associative, 64 byte
line size 1st-level D-cache 32K Bytes, 8-way
associative, 64 byte line size L2 cache
.......... 3 Mb, 12-way associative, 64 byte
line size
4Task 1 Run CPUID on three machines and
interpret results.
There certainly are a number of tools on the web
feel free to use them. I have a program
cpuid.c that I have written that runs on Intel
processors on both Windows and LINUX. I have
downloaded a tool named amdcpuid.exe from the AMD
website you may be able too find many other
even better tools.
Tools for use in your measurements Tools for use in your measurements Tools for use in your measurements
Windows LINUX
Intel cpuz.exe cpuid.c
AMD cpuid.c cpuid.c
5Task 2 Measure performance of various storage
levels
Youve been learning about storage hierarchies.
Youve learned that the fastest storage is
registers, followed by the L1 cache, the L2
cache, main memory and disk. Well, how fast are
they relative to each other? You will run a
program Lab7_TouchLots.c that does a constant
amount of work, but can perform that work in
the amount of memory you specify. The work it
performs is to swap two elements in a square
array swapping the non-diagonal elements in
that array is whats called a transpose. A
megaswap is a million such swaps.
To transpose this matrix requires one swap. To
perform a megaswap, we would need to do
1,000,000 transposes.
1 2
3 4
1 2 3
4 5 6
7 8 9
To transpose this matrix requires three swaps.
To perform a megaswap, we would need to do
333,333 transposes.
6Task 2 Measure performance of various storage
levels
VTune is the most amazing tool. Built by Intel,
it utilizes hooks into the processor and the OS
to determine an amazing number of characteristics
about a running program. For instance, it can
measure, processor branch miss-predictions, L1
and L2 cache misses, CPI and other
goodies. Vtune can be downloaded from
http//software.intel.com/en-us/articles/intel-so
ftware-evaluation-center/ One of your tasks with
VTune will be to figure out what it is measuring.
You now know enough about processors to
understand the language. You will need to read
the documentation and make the tool work for you.
7Task 2 Measure performance of various storage
levels
VTune Screenshot
of L2 cache misses
of L1 Data cache misses
8Task 2 Measure performance of various storage
levels
jbreecher_at_younger/public/comp_org/Lab07 a.out
1000 10000000 We are about to allocate a memory
of size 10000000 bytes The memory allocation
succeeded The array elements are long longs, so
each element is 8 bytes The array we're using has
9999392 bytes, 1249924 elements, 1118 rows
1118 columns Each transpose has 624403 swaps and
there will be 160 Transposes There will be a
total of 999.9045 MegaSwaps in the given
time Elapsed time 2.345834 seconds for a
total of 999.9045 MegaSwaps
This is a sample output. It says that a memory
size of approximately 10,000,000 bytes supports
an array that does 100 megaswaps in 2.34
seconds. As you vary the memory size, what
happens to the time it takes to complete a
constant number of swaps? Remember to change
only one variable at a time.
9Task 2 Measure performance of various storage
levels
Dream On!
More Real
Time/Swap
Time/Swap
Memory Size
Memory Size
Wouldnt it be nice if as you increase memory in
the array, the time/swap increased linearly
dream on! The data is NOT nice and clean and
will require interpretation.
10Task 2 Measure performance of various storage
levels
The same thing goes for cache misses vs. memory
size. This is because theres no simple way to
map how data fits into the cache youll
remember that the cache is a physical cache
the sets are determined by the physical address
rather than the virtual address.
So what you can do, using Vtune is determine the
relationship between cache hits and time this
will help you build a model of the cost of data
hits and misses for both the L1 and the L2 caches.
This IS true!
Time/Swap
Cache Misses
11Task 2 Measure performance of various storage
levels
For a typical cache hierarchy, the cost of doing
an data access (whether it hits in the L1, the
L2, or in memory) is determined by the
equation Average-Time-For-Memory-Access
PL1 TL1 PL2 TL2 PM TM where PL1,
PL2, PM are the probabilities of the data access
hitting in the L1, L2, and memory. PL1 PL2
PM 1. TL1, TL2, TM are the times for the
data to be retrieved in the L1, L2, and
memory. Whats nice about VTune is that it gives
you all these values, when you run the experiment
with different memory sizes. Plug what you know
into the equations and solve the simultaneous
equations. Start small with everything hitting
in the L1 cache. Now you know the TL1 time.
Keep going.
12Task 2 Measure speed from various storage levels
StartTime GetTimeNow() // We want to
transpose the matrix. That means that for all
// elements where Row ltgt Column, we want to
swap // ERowColumn and EColumnRow
for ( Counter 0 Counter lt TotalTransposes
Counter ) for ( Row 0 Row lt
Dimension Row ) for ( Column
Row Column lt Dimension Column )
if ( Row Column ) continue
Source Row Dimension Column
Destination Column Dimension Row
Temp
MemoryPtrSource
MemoryPtrSource MemoryPtrDestination
MemoryPtrDestination Temp
// End of for Row
// End of for Column
// End of for Counter
This is the relevant code
13Task 3 Writing your own code.
- Your task is to write a program that adds up all
the elements in an array. - Here are the rules
- You must be able to add the elements in a
1,000,000,000 byte array. - You may have to add all these elements more than
once so that accurate timing can be achieved
going thru those 1,000,000,000 bytes might be too
fast and not able to be easily measured. I want
the execution of your code to take at least 3
seconds and I want it to report how many times it
completed the addition of all 109 bytes. - For timing, I recommend you use the timing tools
and the start/stop methodology found in the code
you got for Task 2. - You will achieve a top grade if your code
accomplishes the addition faster than mine. - I will be evaluating your code using Windows on a
machine in the Closet Lab. - There may be other rules we need to talk about.
- But please dont ask me can we do X? If the
rules dont say you cant do it, then you can do
it.
14Evaluation Sheet
- Lab 07
Your Name______________________ - Task 1 -
- Results of CPUID experiment can be explained.
Questions about the results achieved can be
articulated. - Important words include L1 Data Cache, L1
Instruction cache, L2 cache, Main Memory,
Associativity, Cache Line Size, TLB - Task 2 -
- Student has a value for the arguments in the
equation - Time-to-load-memory
- Task 3 -
- Code that adds all elements of a matrix has been
coded and executed and the results explained.
How do these results compare to the element
swapping of Task 2. - We can run the program that adds the numbers in
an array it runs as fast as possible. - Students can explain what theyve done to speed
up the program.
15Program Code
//////////////////////////////////////////////////
///////////////////////// // The purpose of this
program is to understand the costs of touching //
various kinds of storage // registers,
// processor cache, // main memory,
// disk // Arg 1 - The number of
"Megaswaps a swap is an interchange //
of two memory locations. // Arg 2 - The memory
size (in bytes) of the storage area that is
being // worked on. // Compiler
Instructions // Windows gcc -O3
Lab7_TouchLots.c -o Lab7_TouchLots //
Linux gcc -O3 Lab7_TouchLots.c -lm -o
Lab7_TouchLots ///////////////////////////////////
//////////////////////////////////////// //
define LINUX define WINDOWS include
ltstdio.hgt include ltstdlib.hgt include
ltmath.hgt ifdef WINDOWS include
ltwtypes.hgt include ltmmsystem.hgt endif
16Program Code
//////////////////////////////////////////////////
///////////////////////// // Prototypes
Globals //////////////////////////////////////////
///////////////////////////////// double
GetTimeNow( ) long long MemoryPtr ///////
//////////////////////////////////////////////////
////////////////// // The main code is
here /////////////////////////////////////////////
////////////////////////////// int main(int argc,
char argv) int Row, Column,
Dimension int Source, Destination
long MemorySizeInBytes,
SwapsInThisTranspose long long
TotalTransposes, Counter, Temp unsigned int
NumberOfMegaSwaps double StartTime,
EndTime if(argc lt 3 ) printf(
"Usage Prog06 ltNumber of MegaSwapsgt
ltMemorySizegt\n") exit(1) //
Get the arguments NumberOfMegaSwaps atoi(
argv1 ) MemorySizeInBytes atol( argv2
) if ( MemorySizeInBytes gt 1700000000 )
printf( "You have requested too large a
memory size\n") exit(0)
17Program Code
printf( "We are about to allocate a memory of
size d bytes\n",
MemorySizeInBytes ) MemoryPtr ( long long
) malloc( MemorySizeInBytes ) if (
MemoryPtr 0 ) printf( "We couldn't
allocate requested amount of memory.\n")
exit(1) printf( "The memory allocation
succeeded\n") Dimension (int)sqrt(
(double)MemorySizeInBytes / sizeof( MemoryPtr )
) printf( "The array elements are long
longs, so each element is 8 bytes\n")
printf( "The array we're using has d bytes, d
elements, d rows d columns\n",
sizeof( MemoryPtr) Dimension
Dimension, Dimension
Dimension, Dimension, Dimension ) // Our
unit of "work" is a single swap. But a large
memory array has many more // swaps required
to do the transpose. So we need to standardize
the number of // swaps accomplished. We do
this by getting the number of transposes to do
// TotalTransposes NumberOfMegaSwaps /
SwapsInThisSizedTranspose SwapsInThisTranspos
e Dimension (Dimension - 1) / 2
TotalTransposes (int)(((double)Number
OfMegaSwaps 1000000) / (double)SwapsInThisTransp
ose) printf( "Each transpose has d swaps
and there will be d Transposes\n",
SwapsInThisTranspose, TotalTransposes)
printf( "There will be a total of .4f MegaSwaps
in the given time\n",
(double)SwapsInThisTranspose (double)TotalTransp
oses/1000000 )
18Program Code
StartTime GetTimeNow() // We want to
transpose the matrix. That means that for all
elements where // Row ltgt Column, we want to
swap ERowColumn and EColumnRow for (
Counter 0 Counter lt TotalTransposes Counter
) for ( Row 0 Row lt Dimension
Row ) for ( Column Row Column
lt Dimension Column ) if (
Row Column ) continue Source
Row Dimension Column // Map array
onto linear memory Destination
Column Dimension Row Temp
MemoryPtrSource
MemoryPtrSource MemoryPtrDestination
MemoryPtrDestination Temp
// End of for Row
// End of for Column
// End of for Counter EndTime
GetTimeNow() printf( "Elapsed time f
seconds\n", EndTime - StartTime ) printf( "
for a total of .4f MegaSwaps \n",
(double)SwapsInThisTranspose (double)TotalTransp
oses/1000000 ) //
End of main
19Program Code
//////////////////////////////////////////////////
///////////////////////// // GetTimeNow()
Return time in seconds ///////////////////////////
//////////////////////////////////////////////// d
ouble GetTimeNow( ) double
time_returned ifdef WINDOWS static short
first_time 1 LARGE_INTEGER
ticks 0,0 LARGE_INTEGER
ticks_per_second 0,0 static double
ticks_per_microsecond if (
first_time 1 )
QueryPerformanceFrequency (ticks_per_second)
ticks_per_microsecond (float)
ticks_per_second.LowPart / 1E6
first_time 0 QueryPerformanceCounter
(ticks) time_returned ticks.LowPart/ticks_
per_microsecond time_returned
ldexp(ticks.HighPart,32)/ticks_per_microsecond
time_returned / 1E6 endif ifdef LINUX
struct timeval tp gettimeofday (tp,
NULL) time_returned tp.tv_usec
time_returned / 1E6 time_returned
tp.tv_sec endif return( time_returned )
// End of GetTimeNow