Title: QQ: Nanoscale Timing and Profiling James Frye
1QQ Nanoscale Timing and ProfilingJames Frye
, James G. King , Christine J. Wilson ?,
Frederick C. Harris, Jr. Department of
Computer Science and EngineeringBrain
Computation Lab?Biomedical Engineering
4th International Workshop on Performance
Modeling, Evaluation, and Optimization of
Parallel and Distributed Systems (PMEO-PDS05)
- University of Nevada, Reno NV 89557
2005 IPDPS Conference19th IEEE International
Parallel Distributed Processing Symposium
2What is QQ
- QQ is a simple and efficient tool for measuring
timing and memory use - Developed for the examination of a massively
parallel program - Easily extensible to inspect other programs
3QQ Development
- QQ was developed to optimize a parallel program
used to simulate cortical neurons NeoCortical
Simulator (NCS) - Our goal for the summer of 2002 was to simulate
106 neurons with 109 synapses within a realistic
run time - Before optimization, NCS would run about 1.5
million synapses at a rate of 1 day per simulated
second of synaptic activity - Clearly optimization of NCS was needed
4NeoCortical Simulator (NCS)
- Originated in the Brain Computation Lab led by
Dr. Phil Goodman - Incorporates membrane dynamics
- Utilizes simulated ion channels to modulate the
membrane voltage changes (when applied) - Compartment based simulator
- Allows for channel dynamics to drive the membrane
voltage
5NCS Biology
- Neuron a brain cell and the basic unit or
compartment - Synapse the region of communication between
compartments - Channel openings in the cellular membrane that
allow the passage of various ions to induce a
voltage gradient across the membrane - Action Potential an electrical signal that
translates to a chemical signal to the
post-synaptic cell
6Neurons
7NCS Biology
- The membrane voltage determines the cells firing
rate - Once threshold voltage is reached the cell sends
an action potential to its connected synapses
82-Cell Model
9No Channels
Sustained firing at maximum rate during a
continuous stimulus
10Ka Channel
Slows the initial response during a sustained
stimulus
11Km Channel
Prevents continuous bursting during a continuous
stimulus
12Kahp Channel
Dampens the effect while still allowing for some
action potentials during a sustained stimulus
13QQ Design
- QQ is designed so that all of its routines can be
selectively compiled into a program - In the QQ.h header file, each routine is defined
with a preprocessor directive, so that if
profiling is not enabled, it reduces to an empty
statement. - ifdef QQ_ENABLE
- void QQInit (int)
- else
- define QQInit (dummy)
- endif
14QQ Design
- Memory profiling routines also use the C
preprocessor to intercept library calls - ifdef QQ_ENABLE
- define malloc(arg) MemMalloc (MEM_KEY, arg)
- endif
- The MemMalloc function records allocation
information, calls the malloc function to do the
actual allocation, and returns the result to the
caller
15QQ Timing
- Extremely accurate measurement of execution
speed. - In theory fine-grained resolution to a single
clock cycle. - In practice, measurements are accurate to tens of
cycles
16Timing Measurements
- Measuring the impact of a line change in the
calculation for the Km channel - From
- I unitaryG strength pow (m, mPower)
(ReversePot CmpV) - To
- I unitaryG strength (ReversePot CmpV)
- Km-type channel, mPower is always 1, so we were
able to change the equation to streamline the
execution - Wrapping the line in calls to QQ, we measure the
effect of this single change - QQStateOn (QQ_Km)
- I unitaryG strength (ReversePot CmpV)
- QQStateOff (QQ_Km)
17Timing Measurements
- Note that both code versions give similar cycle
counts on different processors, though more
consistent and somewhat fewer on P4 than P3. - Times for similar counts are proportional to
processor speed, as expected. - Function call pays a heavy penalty for first
call. It's only called by Km channel code in
this code, so time represents first load of the
code into cache
18Timing Measurements
PIII 800 MHz
19Timing Measurements
P4 2200MHz
20Expanding Timing Information
- QQ allows the user to record an additional item
of information with the normal timing. - QQCount records an integer with the key
- QQCount( eventKey, integer_of_interest )
- QQValue records a double precision floating point
value with the key - QQValue( eventKey, double_of_interest )
- QQState records a state of on or off with the key
- QQStateOn( eventKey ) QQStateOff( eventKey )
- These will be described during discussion of the
output format
21QQ Memory
- Records memory allocation dedicated to the
code-block, rather than the total allocation due
to code and library calls, to single-byte
accuracy
22QQ Memory Example
- NCS implementation of ion channels
- Suppose we want to know the total memory used by
all channels. Each channel function would
require channel key - define MEM_KEY KEY_CHANNEL
- Then at any point in the program execution, just
call the MemPrint function to display memory use
23Memory Usage Output
- Memory Allocation Total Allocated 988 KBytes
- Object Number Number Object
Alloc Total Max - Item Size Created Deleted KB KB Kb KB
- Brain 120 1 0 1 0 1 1
- CellManager 44 1 0 1 1 1 1
- Cell 16 100 0 2 0 2 2
- Channel 252 300 0 74 0 74 74
- Compartment 324 100 0 32 2 33 33
- MessageMgr 16 1 0 1 205 205 205
- MessageBus 0 0 0 0 1 1 1
- Report 80 1 0 1 1 1 1
- Stimulus 252 1 0 1 1 1 1
- Synapse 44 10000 0 430 118 547 547
- --------------------------------------------------
--------------------------------------------------
--------------------------------------------------
--------- - 1 2 3 4 5 6 7 8
- Key
- 1 - Internal name given to recording category
24QQ Applications
- Brain Communication Server (BCS)
- NCS
25Brain Communication Server
- Further experimentation with the simulator
required another application be developed to
coordinate communication between NCS and numerous
potential clients - virtual creatures
- physical robots
- visualization tools
NCS
BCS
26Optimizing BCS
Different applications make non-sequential
requests. No single function was called in a loop
iterating several times, so time needed to be
measured over the course of execution. Then
perform an analysis of QQs final output.
27Parsing QQs output
- QQ uses a straight forward layout for the final
output file - The data can be easily extracted and displayed in
a text report as shown on the previous slide or
sent to a graphical display - The following slides describe the output format
and how to manage the information
28QQ file format
29QQ Format Data Close Up
Node 0 Byte offset
Node 1 Byte offset
Node 2 Byte offset
Where Optional Info is the size of a double, but
contains a State (int), a Count (int), or a Value
(double)
30Gathering the Results
- After reading a nodes data section, entries with
the same key can be gathered. - Using the key table, the user knows what is
contained in the second block of a timing entry
2
1
109342759
2
0
109342768
Example Key 2 has type State The second
block contains integer 1 for on or integer 0
for off By subtracting the event times, the
length of time spent in the on state is
determined
31Another example
4
-65.3477
109342735
4
-58.2367
109342819
Example Key 4 has type Value The second
block contains a double precision value passed in
during execution The value can be saved and
displayed with timing information, or sent to
a separate graph Timing is obtained the same as
before, by subtracting the event times
32NCS Performance Measurement
- QQ was able to hone in on specific blocks of code
and allow measurement at a resolution necessary
to allow for easy interpretation
33Optimization Targets
- QQ analysis quickly identified two major targets
within the code - Synapses
- Message Passing
34Synapses
- Synapses were by far the most common element of
any NCS model with the most memory usage - Active only when an action potential was
processed through the synapse - Pass information between the nodes via message
passing
35Message Parsing Overhead
- Using QQ we were able to identify areas for
improvement within NCS 3 - Many unneeded fields requiring better encoding of
their destination - Fixed number of messages pre-allocated, far more
than needed by the program - Implemented a shared pool, buffers allocated as
needed - Messages sent individually, processed multiple
times - Implemented a packet scheme process packet once
for send, once for receive - Process messages only when used
36Conclusions
- QQ allows profiling of nanoscale timing of code
segments and memory usage analysis - Fine grained measurements of specific events
- Ability to measure memory at an object or event
level with a small memory and performance
footprint - Simple and effective tool
37Future Work
- New Opteron cluster
- BlueGene migration (how many processors?)
- Robotic integration
38Q A