QQ: Nanoscale Timing and Profiling James Frye - PowerPoint PPT Presentation

About This Presentation
Title:

QQ: Nanoscale Timing and Profiling James Frye

Description:

... the actual allocation, and returns the result to the caller ... Using the key table, the user knows what is contained in the second block of a timing entry ... – PowerPoint PPT presentation

Number of Views:50
Avg rating:3.0/5.0
Slides: 39
Provided by: christin335
Category:

less

Transcript and Presenter's Notes

Title: QQ: Nanoscale Timing and Profiling James Frye


1
QQ Nanoscale Timing and ProfilingJames Frye
, James G. King , Christine J. Wilson ?,
Frederick C. Harris, Jr. Department of
Computer Science and EngineeringBrain
Computation Lab?Biomedical Engineering
4th International Workshop on Performance
Modeling, Evaluation, and Optimization of
Parallel and Distributed Systems (PMEO-PDS05)
  • University of Nevada, Reno NV 89557

2005 IPDPS Conference19th IEEE International
Parallel Distributed Processing Symposium
2
What is QQ
  • QQ is a simple and efficient tool for measuring
    timing and memory use
  • Developed for the examination of a massively
    parallel program
  • Easily extensible to inspect other programs

3
QQ Development
  • QQ was developed to optimize a parallel program
    used to simulate cortical neurons NeoCortical
    Simulator (NCS)
  • Our goal for the summer of 2002 was to simulate
    106 neurons with 109 synapses within a realistic
    run time
  • Before optimization, NCS would run about 1.5
    million synapses at a rate of 1 day per simulated
    second of synaptic activity
  • Clearly optimization of NCS was needed

4
NeoCortical Simulator (NCS)
  • Originated in the Brain Computation Lab led by
    Dr. Phil Goodman
  • Incorporates membrane dynamics
  • Utilizes simulated ion channels to modulate the
    membrane voltage changes (when applied)
  • Compartment based simulator
  • Allows for channel dynamics to drive the membrane
    voltage

5
NCS Biology
  • Neuron a brain cell and the basic unit or
    compartment
  • Synapse the region of communication between
    compartments
  • Channel openings in the cellular membrane that
    allow the passage of various ions to induce a
    voltage gradient across the membrane
  • Action Potential an electrical signal that
    translates to a chemical signal to the
    post-synaptic cell

6
Neurons
7
NCS Biology
  • The membrane voltage determines the cells firing
    rate
  • Once threshold voltage is reached the cell sends
    an action potential to its connected synapses

8
2-Cell Model
9
No Channels
Sustained firing at maximum rate during a
continuous stimulus
10
Ka Channel
Slows the initial response during a sustained
stimulus
11
Km Channel
Prevents continuous bursting during a continuous
stimulus
12
Kahp Channel
Dampens the effect while still allowing for some
action potentials during a sustained stimulus
13
QQ Design
  • QQ is designed so that all of its routines can be
    selectively compiled into a program
  • In the QQ.h header file, each routine is defined
    with a preprocessor directive, so that if
    profiling is not enabled, it reduces to an empty
    statement.
  • ifdef QQ_ENABLE
  • void QQInit (int)
  • else
  • define QQInit (dummy)
  • endif

14
QQ Design
  • Memory profiling routines also use the C
    preprocessor to intercept library calls
  • ifdef QQ_ENABLE
  • define malloc(arg) MemMalloc (MEM_KEY, arg)
  • endif
  • The MemMalloc function records allocation
    information, calls the malloc function to do the
    actual allocation, and returns the result to the
    caller

15
QQ Timing
  • Extremely accurate measurement of execution
    speed.
  • In theory fine-grained resolution to a single
    clock cycle.
  • In practice, measurements are accurate to tens of
    cycles

16
Timing Measurements
  • Measuring the impact of a line change in the
    calculation for the Km channel
  • From
  • I unitaryG strength pow (m, mPower)
    (ReversePot CmpV)
  • To
  • I unitaryG strength (ReversePot CmpV)
  • Km-type channel, mPower is always 1, so we were
    able to change the equation to streamline the
    execution
  • Wrapping the line in calls to QQ, we measure the
    effect of this single change
  • QQStateOn (QQ_Km)
  • I unitaryG strength (ReversePot CmpV)
  • QQStateOff (QQ_Km)

17
Timing Measurements
  • Note that both code versions give similar cycle
    counts on different processors, though more
    consistent and somewhat fewer on P4 than P3.
  • Times for similar counts are proportional to
    processor speed, as expected.
  • Function call pays a heavy penalty for first
    call. It's only called by Km channel code in
    this code, so time represents first load of the
    code into cache

18
Timing Measurements
PIII 800 MHz
19
Timing Measurements
P4 2200MHz
20
Expanding Timing Information
  • QQ allows the user to record an additional item
    of information with the normal timing.
  • QQCount records an integer with the key
  • QQCount( eventKey, integer_of_interest )
  • QQValue records a double precision floating point
    value with the key
  • QQValue( eventKey, double_of_interest )
  • QQState records a state of on or off with the key
  • QQStateOn( eventKey ) QQStateOff( eventKey )
  • These will be described during discussion of the
    output format

21
QQ Memory
  • Records memory allocation dedicated to the
    code-block, rather than the total allocation due
    to code and library calls, to single-byte
    accuracy

22
QQ Memory Example
  • NCS implementation of ion channels
  • Suppose we want to know the total memory used by
    all channels. Each channel function would
    require channel key
  • define MEM_KEY KEY_CHANNEL
  • Then at any point in the program execution, just
    call the MemPrint function to display memory use

23
Memory Usage Output
  • Memory Allocation Total Allocated 988 KBytes
  • Object Number Number Object
    Alloc Total Max
  • Item Size Created Deleted KB KB Kb KB
  • Brain 120 1 0 1 0 1 1
  • CellManager 44 1 0 1 1 1 1
  • Cell 16 100 0 2 0 2 2
  • Channel 252 300 0 74 0 74 74
  • Compartment 324 100 0 32 2 33 33
  • MessageMgr 16 1 0 1 205 205 205
  • MessageBus 0 0 0 0 1 1 1
  • Report 80 1 0 1 1 1 1
  • Stimulus 252 1 0 1 1 1 1
  • Synapse 44 10000 0 430 118 547 547
  • --------------------------------------------------
    --------------------------------------------------
    --------------------------------------------------
    ---------
  • 1 2 3 4 5 6 7 8
  • Key
  • 1 - Internal name given to recording category

24
QQ Applications
  • Brain Communication Server (BCS)
  • NCS

25
Brain Communication Server
  • Further experimentation with the simulator
    required another application be developed to
    coordinate communication between NCS and numerous
    potential clients
  • virtual creatures
  • physical robots
  • visualization tools

NCS
BCS
26
Optimizing BCS
Different applications make non-sequential
requests. No single function was called in a loop
iterating several times, so time needed to be
measured over the course of execution. Then
perform an analysis of QQs final output.
27
Parsing QQs output
  • QQ uses a straight forward layout for the final
    output file
  • The data can be easily extracted and displayed in
    a text report as shown on the previous slide or
    sent to a graphical display
  • The following slides describe the output format
    and how to manage the information

28
QQ file format
29
QQ Format Data Close Up
Node 0 Byte offset
Node 1 Byte offset
Node 2 Byte offset
Where Optional Info is the size of a double, but
contains a State (int), a Count (int), or a Value
(double)
30
Gathering the Results
  • After reading a nodes data section, entries with
    the same key can be gathered.
  • Using the key table, the user knows what is
    contained in the second block of a timing entry

2
1
109342759
2
0
109342768
Example Key 2 has type State The second
block contains integer 1 for on or integer 0
for off By subtracting the event times, the
length of time spent in the on state is
determined
31
Another example
4
-65.3477
109342735
4
-58.2367
109342819
Example Key 4 has type Value The second
block contains a double precision value passed in
during execution The value can be saved and
displayed with timing information, or sent to
a separate graph Timing is obtained the same as
before, by subtracting the event times
32
NCS Performance Measurement
  • QQ was able to hone in on specific blocks of code
    and allow measurement at a resolution necessary
    to allow for easy interpretation

33
Optimization Targets
  • QQ analysis quickly identified two major targets
    within the code
  • Synapses
  • Message Passing

34
Synapses
  • Synapses were by far the most common element of
    any NCS model with the most memory usage
  • Active only when an action potential was
    processed through the synapse
  • Pass information between the nodes via message
    passing

35
Message Parsing Overhead
  • Using QQ we were able to identify areas for
    improvement within NCS 3
  • Many unneeded fields requiring better encoding of
    their destination
  • Fixed number of messages pre-allocated, far more
    than needed by the program
  • Implemented a shared pool, buffers allocated as
    needed
  • Messages sent individually, processed multiple
    times
  • Implemented a packet scheme process packet once
    for send, once for receive
  • Process messages only when used

36
Conclusions
  • QQ allows profiling of nanoscale timing of code
    segments and memory usage analysis
  • Fine grained measurements of specific events
  • Ability to measure memory at an object or event
    level with a small memory and performance
    footprint
  • Simple and effective tool

37
Future Work
  • New Opteron cluster
  • BlueGene migration (how many processors?)
  • Robotic integration

38
Q A
Write a Comment
User Comments (0)
About PowerShow.com