Title: APC523AST523 Scientific Computation in Astrophysics
1APC523/AST523Scientific Computation in
Astrophysics
- Lecture 4
- Programming for scientific computation
2Topics covered today
- Computer Languages
- Good Programming Style
- Software Engineering for Gradstudents
- Debugging
- Optimization
31. Computer Languages
- Primitive Languages
- Compiled (normal) Languages
- Interpreted (scripting) Languages
4Primitive Languages
- E.g. Machine Code Assembler
- Forth Postscript
- Require explicit instructions about
- how to do everything
- Extremely powerful, in the right hands
- Very tedious to use
5C
- int main(void)
- float x 1
- float y 10
- float z x y
- return 0
-
N.b. an optimising compiler will ignore all of
the floating point code (as it isnt actually
used) and simply return 0
6PPC Assembler
- 0x2d08 0x3c003f80 N.b. 0x3f80 16256
- 0x901e0020
- 0x3c004120 N.b. 0x4120 16672
- 0x901e001c
- 0x2d18 0xc1be0020
- 0xc01e001c
- 0xec0d002a
- 0xd01e0018
- 0x2d08 lis r0,16256 r0 16256
ltlt 16 - stw r0,32(r30) r3032 r0
- lis r0,16672 r0 16672
ltlt 16 - stw r0,28(r30) r3028 r0
- 0x2d18 lfs f13,32(r30) f13
r3032 - lfs f0,28(r30) f0
r3028 - fadds f0,f13,f0 f0 f13
f0 - stfs f0,24(r30) r3024 f0
7Forth/Postscript
N.b. No concept of variable or datatype
8Compiled (Normal) languages
- Fortran (IV 77 90 95)
- C (KR 89 99)
- C
- Java (sort of)
- Ada
- Lisp
- Modula (2, 3, Oberon)
- Pascal
9Characteristics of Compiled Languages
- Good performance
- Longish Edit-Compile-Link-Run cycle
- Types defined (and checked) at compile time
- Usually poor intrinsic support for datatypes
beyond arrays
10Interpreted (Scripting) languages
- perl (4 5 6?)
- python
- TCL (7 8)
- IDL
- IRAF cl
- MATLAB
- lua
- ruby
- smalltalk
11Characteristics of Interpreted Languages
- Poor/Bad performance
- Short Edit-Run cycle
- Types often defined dynamically
- Good intrinsic support for datatypes
- Arrays dictionaries lists
- Extensive libraries
- FITS sql xml
- Graphics (FITS viewers Tk)
12ScriptingCompiled languages
- Combine the best features of Scripting and
Compiled languages - Write the compute-intensive parts in e.g. C
- Write the rest in e.g. Python
- Usually implemented via dynamical libraries
- The interfaces can usually be machine-generated
using a scripting language, or SWIG, (or emacs)
13Object-Oriented Design
Once we move beyond forth and Fortran 77, we
naturally start packaging data into containers
rather than
define NPT 1000 float xNPT, yNPT, zNPT,
vxNPT, vyNPT, vzNPT
We write
typedef struct float x, y, z float vx,
vy, vz POWER_POINT POWER_POINT ptsNPT
(Actually, wed probably write something like
POWER_POINT pts malloc(nsizeof(POWE
R_POINT))
assert (pts ! NULL)
)
14Is this Object-Oriented Design?
15Object-Oriented Design
- Associate code (methods) with data
- Make usage of objects independent of (current)
implementation - Design in terms of networks of objects, rather
than sets of function calls - Support polymorphism where possible
16Python
- class actAtmosModel(object)
- """Describe an atmospheric model"""
- def __init__(self)
- self.skyOpticalDepth_dry
3actUtils.ACT_NO_VALUE - self.CT2 3actUtils.ACT_NO_VALUE
- def make_atmos(atmosFile, band, hard, alt0, az0,
rand) - """Make a model of the atmosphere, and
project it onto the sky""" - atModel allocAndInitialize("atmos",
actAtmosModel()) - if atmosFile0 "gt"
- atmosFile atmosFile1
- size 100 size
of atmospheric patch, (m) - pixscale 0.1 size
of pixels in simulation (m) - npixel int(size/pixscale 0.5)
- at actCalculateAtmosphere(npixel,
pixscale, atModel.L_outer, rand)
17C SWIG
actImage.h
actUtils.i
18Which language should I choose?
Write the compute-intensive parts in a compiled
language (C, C, F90) For interactive code,
its best to use a scripting language. Such
languages have a short debug-rerun cycle It is
possible to mix languages.
Well discuss later how to know what should be in
C/Fortran, and what can be in any language you
like
192. Elements of Programming Style
- Write code that expresses the algorithm
- Design datatypes that either map naturally to the
problem, or to the algorithm, or to both - Try to separate the program into as many
self-contained parts as possible - Dont over-abstract the problem
- Never believe that its just a quick hack
20Elements of style (IIa)
- Always declare variables, and add a comment if
theyre non-trivial. If the language permits,
initialise the variables where you declare them - Separate code from interfaces
- Put all prototypes/typedefs in .h files
- Always use module definitions
- Document your APIs (function signatures)
21/ Return the detected power for a given
pointing. The input model of the sky is assumed
to have been convolved with the beam / int
actGetSample( actFilter band,
// the band of interest const
actTelstate restrict pointing, // where are we
pointing? const actHardware restrict
hardware, // the PSFs, scales, array geometries
etc. const actSky restrict sky, //
sky model, convolved with the beam const
actAtmos restrict atmos, // atmosphere
model actArrayNoiseModel restrict nm,
// model of noise to add (or NULL) actRandom
restrict rand) // a source of
entropy (NULL if nm is NULL) assert (band
gt 0 band lt ACT_NBAND) assert (pointing
! NULL) assert (hardware ! NULL)
const actArrayGeom arr hardware-gtarrayGeomsban
d const int nrow arr-gtnrow
const int ncol arr-gtncol / Image
the sky with the array / double alt
0, az 0 // alt, az of a pixel in
the array assert (sky ! NULL sky-gtvalues
! NULL sky-gtbeam ! NULL sky-gtbeam-gtncomp
gt 0) const actWCS wcs sky-gtvalues00-gtw
cs assert (wcs ! NULL wcs-gttype
ACT_COE wcs-gtunit ACT_RADEC) const
float C sqrt(atmos-gtCP2band) //
atmos-gtvalues assumes that CT2 is 1 mK2 m-5/3
22Elements of style (IIb)
- Write comments as you go, documenting what the
code is supposed to do
/ Find which pixel that peak lies in, so
(40.6, 40.5) --gt (40, 40) (adding 0.5 and
truncating is the wrong thing to do) /
rowc objc-gtcolorc-gtrowc colc
objc-gtcolorc-gtcolc
if(objc-gtcolorc-gtflags OBJECT1_DEBLENDED_AS_PS
F) int rad // radius to mask i rowc -
rmin if(rowc lt rmin) // something's rotten
in the state of the astrometry i 0
row sym-gtROWSi rad
0.5((rsize lt csize) ? rsize csize)
float x // a real variable called x
y my_function(x) // call my_function
with argument x j 10 // add 10 to j
23Elements of style (IIc)
- Good variable/function names are more useful than
comments. - Protect your namespace
- Use consistent formatting (whitespace )
- It doesnt matter which editor you use,
providing - It supports syntax colouring
- It supports proper code indentation
243. Software engineering
- We expect you to demonstrate knowledge of just
use two tools - make
- cvs (or svn)
25Make
- Youve probably typed
- cc foo.c
- or
- f77 foo.f
- and been surprised to see a file named a.out.
- So you wrote a shell script
- cc -o foo foo.c
- or
- !/bin/sh
- cc -o 1 1.c
26- Then, after a while, you have make_foo
- !/bin/sh
- cc -c -g -O2 foo.c
- cc -c -g -O2 goo.c
- cc -o foo foo.o goo.o -lm
- Used as
- /bin/rm -f .o make_foo
- We expect you to use make instead. Write a
Makefile that looks something like - .c.o
- (CC) -c (CFLAGS) .c
- CC cc
- CFLAGS -g -O2
- LIBS -lm
- OBJS foo.o goo.o
- foo (OBJS)
27More Makefile Boilerplate
- PROGS foo
-
- Update Makefile dependencies
-
- -include .Makefile.depend
- .PHONY depend
- depend
- _at_echo Rebuilding make dependencies
- (CC) (CFLAGS) -MM (OBJS.o.c) gt
.Makefile.depend -
- clean
- (RM) (PROGS) .o core
28Source Code Managers
- There are three popular unix source code
managers - cp/rsync
- cvs
- svn
- We expect that youll use cvs or svn
29Naïve Source Code Managers
- People have various strategies to avoid
catastrophes when working on code - Pray, and rely on system backups
- Rsync copies of your code to geographically
distributed localities - Make snapshots every day/week/month/year and
apply one of the two previous methods.
The latter helps with I didnt change anything,
but my code stopped working
30CVS
- rsync users type
- rsync -r musebackups
- cvs users type
- cvs ci
In the interests of full disclosure, once upon a
time they also had to type export
CVS_RSHssh export CVSROOTjeeves.astro.princeto
n.edu/u/cvs/src cvs import my_project v1_0
v1_0 cvs checkout my_project
There are many introductions to cvs on the web
the one that I recommend is http//www.astro.pri
nceton.edu/rhl/cvs/cvs-cookbook.html
31Useful CVS commands
324. Debugging
- There are two schools of thought on debugging
- Sprinkle the code with print statements
- Use a debugger
- Sometimes the former approach is unavoidable,
usually when youve violated the rules of the
language, or if your favourite debugger is buggy.
33GDB
My preferred debugger is gdb, with a nice,
powerful, command line interface and a limited
macro facility
- Many people have implemented a graphical user
interface on top of GDB, e.g. - Insight is a GUI for GDB written in tcl/tk.
- DDD is a popular GUI for GDB and dbx (also xxgdb)
- Code Medic is another GUI written for GDB.
- kdbg is another GUI written for GDB, designed for
KDE - HP Wildebeest (WDB) version of GDB which is
included in new versions of HPUX. - GNU Visual Debugger written in Ada and uses the
GtkAda graphical toolkit. - Jessie written in Java. Includes multi-thread and
multi-process features. - RHIDE is yet another IDE, this one with a look
and feel similar to the Borland 3.1 toolset. - (AST523 doesnt vouch for any of these)
34Notes on using Debuggers
- Youll have to compile with the -g flag, to
include debugging information in the .o file - If you compile with -On, youll find it harder to
see whats going on as the compiler works harder - Code is moved around, e.g out of loops
- Flow-of-control may be confusing
- Variables may not exist, or their values may be
wrong (due to using registers) - Functions may not exists if theyve been inlined
35Why do I use a Debugger?
Because it allows me to explore hypotheses about
what went wrong as the fancy strikes me.
- I can look at anything, not just read the output
from compiled-in print statements - print object-gtchild-gtpsfCounts
- I can tell the program to stop wherever I think
that it might be interesting - stop in estimate_entropy when S_in gt S_out
36Debugging Memory Problems
Memory (stack or heap) brings its own problems
- Corruption - you wrote where you shouldnt
- Leaks - you failed to free memory that you were
finished with - Not a problem for languages with Garbage
Collection (e.g. python), but be careful when you
mix languages with e.g. SWIG
37Debugging Memory Problems
- Most unix versions have debugging versions of the
heap-management libraries that you can link to
(or enable via environment variables). - There are tools such as purify (commercial) and
valgrind that can be used to find leaks - Ive found specialised wrappers around malloc
very useful in SDSS and Pan-STARRS they provide
e.g. unique Ids for every memory transaction
38Optimization
- Only optimize code that needs to be optimised
- Profile, dont guess, to find the bottlenecks
- Improve algorithms before fiddling with code (but
dont be lazy) - Trust the compiler to do a lot of the grunt work
- Moores Law trumps writing assembler
39Profiling
- The standard unix profiler is gprof (there are
also hardware specific profilers of which more
anon) - Compile and link all your code with -pg
- Run your masterpiece, which will produce a file
called gmon.out - Run gprof masterpiece to generate the desired
profile
40Profiling with gprof
- gprof produces two types of information
- Statistics on the call stack for every function
call - (this is done by special code inserted in all
function prologues to save sp, which is why you
compile with -pg) - What is happening every tick (0.01s)
- (this is done by the CRTL code generating SIGPROF
interrupts every tick so save sp, which is why
you link with -pg)
41Sample gprof Output
- Flat profile
- Each sample counts as 0.01 seconds.
- cumulative self self
total - time seconds seconds calls s/call
s/call name - 64.91 484.05 484.05 1 484.05
536.54 spatial_convolve - 13.21 582.54 98.49 1 98.49
98.49 spreadMask - 7.04 635.04 52.50 19213 0.00
0.00 make_kernel - 6.75 685.38 50.34 200 0.25
0.27 getPsfCenters - 1.94 699.84 14.46 204 0.07
0.10 getStampStats3 - 1.49 710.95 11.11 1372 0.01
0.01 xy_conv_stamp - 0.99 718.36 7.41
main - 0.59 722.75 4.39 5906776 0.00
0.00 ran1 - 0.50 726.46 3.71 8266178 0.00
0.00 get_background - 0.46 729.88 3.42 7 0.49
0.49 fset - 0.45 733.26 3.38 238835 0.00
0.00 checkPsfCenter - 0.32 735.66 2.40 4 0.60
0.60 makeNoiseImage4 - 0.29 737.80 2.14 1 2.14
2.14 makeInputMask - 0.25 739.65 1.85 215 0.01
0.01 sigma_clip
42Sample gprof Output (II)
- index time self children called name
-
ltspontaneousgt - 1 99.8 7.41 736.79 main
1 - 484.05 52.49 1/1
spatial_convolve 2 - 98.49 0.00 1/1
spreadMask 3 - 0.00 74.26 100/100
buildStamps 4 - 0.02 11.22 28/30
fillStamp 8 - 3.71 0.00 8266128/8266178
get_background 11 - 3.42 0.00 7/7
fset 12 - 2.40 0.00 4/4
makeNoiseImage4 14 - 0.02 2.26 2/2
check_stamps 15 - 2.14 0.00 1/1
makeInputMask 16 - 0.00 1.43 1/1
fitKernel 19 - 0.42 0.00 1/1
getNoiseStats3 24 - 0.28 0.12 4/204
getStampStats3 7 - 0.05 0.00 1/1
hp_fits_write_subset 32 - 0.01 0.00 1/215
sigma_clip 17 - 0.01 0.00 3/19213
make_kernel 6 - 0.00 0.00 209/209
imin 45
(To Be Continued)
43(continued)
- index time self children called
name - 484.05 52.49 1/1
main 1 - 2 71.9 484.05 52.49 1
spatial_convolve 2 - 52.49 0.00 19208/19213
make_kernel 6 - -----------------------------------------------
- 98.49 0.00 1/1
main 1 - 3 13.2 98.49 0.00 1
spreadMask 3 - -----------------------------------------------
- 0.00 74.26 100/100
main 1 - 4 10.0 0.00 74.26 100
buildStamps 4 - 50.34 3.38 200/200
getPsfCenters 5 - 14.18 6.02 200/204
getStampStats3 7 - 0.34 0.00 200/200
cutStamp 25 - -----------------------------------------------
- 50.34 3.38 200/200
buildStamps 4 - 5 7.2 50.34 3.38 200
getPsfCenters 5 - 3.38 0.00 238835/238835
checkPsfCenter 13 - 0.00 0.00 28/28
quick_sort 47 - -----------------------------------------------
44Profiling with vendor tools
Most modern processors have hardware counters
that keep track of instructions executed,
floating point operations completed, cache hits,
etc. Accessing this information using requires a
profiling software provided (sold!) by the chip
manufacturer. Advantage can provide much more
detailed information about performance Example
SpeedShop on SGI Origin and Altix machines
45Summary for execution of athena -i
../tst/2D-mhd/athinput.linear-wave
time/tlim0.2
Based on 400
MHz IP35
MIPS R12000/R14000 CPU
Typical
Minimum Maximum Event Counter Name
Counter Value
Time (sec) Time (sec) Time (sec)
0 Cycles......................................
................ 1332354800 3.330887
3.330887 3.330887 16 Executed prefetch
instructions..............................
676432 0.000000 0.000000 0.000000 21
Graduated floating point instructions.............
.......... 502607856 1.256520 0.628260
65.339021 2 Decoded loads.....................
.......................... 448868464
1.122171 1.122171 1.122171 18 Graduated
loads.............................................
443549696 1.108874 1.108874
1.108874 3 Decoded stores........................
...................... 252710176 0.631775
0.631775 0.631775 19 Graduated
stores............................................
252191312 0.630478 0.630478
0.630478 4 Miss handling table
occupancy...............................
135901872 0.339755 0.339755
0.339755 25 Primary data cache misses.............
...................... 10219840 0.217172
0.055443 0.217172 24 Mispredicted
branches.......................................
8901440 0.162006 0.133522
0.196054 6 Resolved conditional
branches...............................
63185632 0.157964 0.157964
0.157964 22 Quadwords written back from primary
data cache.............. 11542880
0.114852 0.090612 0.114852 26 Secondary
data cache misses.................................
44032 0.010996 0.006938
0.010996 9 Primary instruction cache
misses............................ 56768
0.002414 0.000616 0.002414 7 Quadwords
written back from scache..........................
46080 0.000978 0.000680
0.001010 23 TLB misses............................
...................... 3264 0.000635
0.000635 0.000635 10 Secondary instruction
cache misses..........................
240 0.000060 0.000038 0.000060 31
Store/prefetch exclusive to shared block in
scache.......... 12192 0.000030
0.000030 0.000030 30 Store/prefetch exclusive
to clean block in scache........... 288
0.000001 0.000001 0.000001 1 Decoded
instructions......................................
.. 1708899584 0.000000 0.000000
4.272249 5 Failed store conditionals.............
...................... 0 0.000000
0.000000 0.000000 8 Correctable scache
data array ECC errors....................
0 0.000000 0.000000 0.000000 11
Instruction misprediction from scache way
prediction table.. 3824 0.000000
0.000000 0.000010 12 External
interventions.....................................
. 5616 0.000000 0.000000
0.000000 13 External invalidations................
...................... 21712 0.000000
0.000000 0.000000 14 ALU/FPU progress
cycles.....................................
0 0.000000 0.000000 0.000000 15
Graduated instructions............................
.......... 1605826832 0.000000 0.000000
4.014567 17 Prefetch primary data cache
misses.......................... 88256
0.000000 0.000000 0.000221 20 Graduated
store conditionals................................
0 0.000000 0.000000
0.000000 27 Data misprediction from scache way
prediction table......... 77648
0.000000 0.000000 0.000194 28 State of
intervention hits in scache.......................
. 5520 0.000000 0.000000
0.000000 29 State of invalidation hits in
scache........................ 5008
0.000000 0.000000 0.000000
46Statistics
Graduated instructions/cycle.....................
...........................
1.205255 Graduated floating point
instructions/cycle................................
. 0.377233 Graduated loads
stores/cycle......................................
........ 0.522189 Graduated loads
stores/floating point instruction.................
........ 1.384262 Mispredicted
branches/Resolved conditional branches............
............ . 0.140878 Graduated loads
/Decoded loads ( and prefetches
)...........................
0.986664 Graduated stores/Decoded
stores............................................
. 0.997947 Data mispredict/Data scache
hits............................................
0.007631 Instruction mispredict/Instruction
scache hits..............................
0.067648 L1 Cache Line Reuse......................
...................................
67.077485 L2 Cache Line Reuse.....................
....................................
231.100291 L1 Data Cache Hit Rate.................
.....................................
0.985311 L2 Data Cache Hit Rate...................
...................................
0.995692 Time accessing memory/Total
time............................................
0.590880 Time not making progress (probably
waiting on memory) / Total time..........
1.000000 L1--L2 bandwidth used (MB/s, average per
process)...........................
153.629036 Memory bandwidth used (MB/s, average
per process)...........................
1.913417 MFLOPS (average per process).............
...................................
150.893097 Cache misses in flight per cycle
(average)..................................
0.102001 Prefetch cache miss rate.................
................................... 0.130473
47Improving Algorithms Sorting
- Bubble sort n2
- Insertion sort n2
- Shell sort n3/2
- Quick sort n ln(n)
- Heap sort n ln(n)
- Radix sort n
- Stupid sort n!
48Improving Algorithms Astronomy
- Consider a CCD with a few bad pixels.
If I want to ask Is this pixel bad?, an
unsigned char mask might be a good
representation If I want to return all of the
bad pixels, a struct int x, y
badpixels might be a good representation
49Lets look at another example
- include ltstdlib.hgt
- include ltstdio.hgt
- include ltmath.hgt
- include "alias.h"
- bool AST523_calc_trajectory(
- AST523_TRAJECTORY traj, //
object's trajectory - float height0, // height
above ground where egg was released m - float vel0) //
initial velocity of egg m/s -
- const float g 9.81 //
acceleration due to gravity m/s2 - for (int i 0 i lt traj-gtnpt i)
- float t traj-gttimei
- traj-gtheighti height0 vel0t -
gpow(t,2)/2 -
- return (traj-gtheighttraj-gtnpt - 1 lt 0) ?
true false
50alias.h
- if !defined(ALIAS_H)
- define ALIAS_H
- include ltstdbool.hgt
- typedef struct
- float time // time
since egg was thrown - float height // height
of egg at time time - int npt //
dimension of height,time - AST523_TRAJECTORY
- bool AST523_calc_trajectory(
- AST523_TRAJECTORY traj, //
object's trajectory - float height0, //
initial height above ground m - float vel0) //
object's initial velocity m/s - endif
51What does gprof tell us?
gprof --line egg_toss
- Each sample counts as 0.01 seconds.
- cumulative self self
total - time seconds seconds calls Ts/call
Ts/call name - 46.01 2.77 2.77
AST523_calc_trajectory (alias.c16) - 27.79 4.43 1.67
main (main.c39) - 20.05 5.64 1.21
AST523_calc_trajectory (alias.c14) - 5.32 5.96 0.32
main (main.c38) - 0.83 6.01 0.05
AST523_calc_trajectory (alias.c15) - 0.00 6.01 0.00 1 0.00
0.00 AST523_calc_trajectory (alias.c11) - 0.00 6.01 0.00 1 0.00
0.00 trajDel (main.c24) - 0.00 6.01 0.00 1 0.00
0.00 trajNew (main.c12)
alias.c14 for (int i 0 i lt traj-gtnpt i)
alias.c15 float t traj-gttimei alias.c
16 traj-gtheighti height0 vel0t -
gpow(t,2)/2 alias.c17
52- 0x804869d alias.c14 mov (edi),edx
- 0x804869f alias.c14 xor esi,esi
- 0x80486a1 alias.c14 cmp 0x0,edx
- 0x80486a4 alias.c14 jmp 0x80486e8
ltalias.c14gt - 0x80486a6 alias.c14 mov esi,esi
- 0x80486a8 alias.c15 mov 0x4(edi),eax
- 0x80486ab alias.c15 flds (eax,esi,4)
- 0x80486ae alias.c16 flds 0x10(ebp)
- 0x80486b1 alias.c16 mov 0x8(edi),ebx
- 0x80486b4 alias.c16 fmul st(1),st
- 0x80486b6 alias.c16 push 0x40000000
- 0x80486bb alias.c16 fadds 0xc(ebp)
- 0x80486be alias.c16 push 0x0
- 0x80486c0 alias.c16 sub 0x8,esp
- 0x80486c3 alias.c16 fstpl 0xffffffe0(ebp)
- 0x80486c6 alias.c16 fstpl (esp)
- 0x80486c9 alias.c16 call 0x8048450
lt_init88gt
53Each sample counts as 0.01 seconds.
cumulative self self total
time seconds seconds calls
Ts/call Ts/call name 53.60 2.78
2.78
AST523_calc_trajectory (alias.c16) 32.67
4.47 1.69 main
(main.c39) 5.91 4.78 0.31
AST523_calc_trajectory
(alias.c14) 3.68 4.97 0.19
AST523_calc_trajectory
(alias.c15) 3.10 5.13 0.16
main (main.c38 _at_ 804868c)
1.45 5.20 0.08
main (main.c38 _at_ 8048669) 0.00 5.20
0.00 1 0.00 0.00
AST523_calc_trajectory (alias.c11)
Make the obvious change
mov (edi),edx inc esi add
0x10,esp cmp esi,edx jg 0x80486a8
ltalias.c15gt
inc esi cmp 0xffffffe8(ebp),esi jl
0x8048704 ltalias.c15gt
alias.c13 const int npt traj-gtnpt //
dealias traj-gtnpt alias.c14 for (int i 0 i lt
npt i) alias.c15 float t
traj-gttimei alias.c16 traj-gtheighti
height0 vel0t - gpow(t,2)/2 alias.c17
54Back to the previous example
gprof --line hotpants
- Flat profile
- Each sample counts as 0.01 seconds.
- cumulative self self
total - time seconds seconds calls ns/call
ns/call name - 10.95 81.64 81.64
spreadMask (functions.c1372) - 6.39 129.29 47.65
spatial_convolve (alard.c1287) - 5.67 171.58 42.29
spatial_convolve (alard.c1295) - 5.26 210.81 39.23
spatial_convolve (alard.c1298) - 4.97 247.84 37.03
spatial_convolve (alard.c1301) - 4.93 284.61 36.77
spatial_convolve (alard.c1294) - 4.93 321.38 36.77
spatial_convolve (alard.c1293) - 4.29 353.38 32.01
spatial_convolve (alard.c1291)
55functions.c
- functions.c1372
- mDataiirPixXjj FLAG_OK_CONV(!(mDataiir
PixXjj -
FLAG_INPUT_ISBAD))
(gdb) b functions.c1372 (gdb) continue (gdb)
x/16i pc 0x8059d93 imul eax,ecx 0x8059d96
add 0xffffffec(ebp),ecx 0x8059d99 mov
0x8(ebp),eax 0x8059d9c mov
(eax,ecx,4),eax 0x8059d9f test
al,al 0x8059da1 mov ecx,0xffffffe4(ebp) 0x
8059da4 mov eax,0xffffffe0(ebp) 0x8059da7
js 0x8059dad ltfunctions.c1372gt 0x8059da9
orl 0x40,0xffffffe0(ebp) 0x8059dad mov
0xffffffe0(ebp),ecx 0x8059db0 mov
0xffffffe4(ebp),esi 0x8059db3 mov
0x8(ebp),eax 0x8059db6 mov
ecx,(eax,esi,4) 0x8059db9 mov
0x818a6d4,ecx 0x8059dbf mov
ecx,0xffffffe8(ebp) 0x8059dc2 mov
0x818a6e8,ecx
globals.hint rPixX
56Lets help the compiler
- Make those globals local, and
0x8059da8 imul 0xffffffe4(ebp),eax 0x8059dac
add 0xffffffe8(ebp),eax 0x8059daf mov
0xfffffff0(ebp),ecx 0x8059db2 mov
eax,0xffffffd8(ebp) 0x8059db5 mov
(ecx,eax,4),eax
57alard.c
- alard.c1287
- for (ic i - hwKernel ic lt i hwKernel
ic)
globals.hint hwKernel
58Rerun gprof
- Flat profile
- Each sample counts as 0.01 seconds.
- cumulative self self
total - time seconds seconds calls s/call
s/call name - 62.42 432.09 432.09 1 432.09
477.76 spatial_convolve - 14.88 535.07 102.98 1 102.98
102.98 spreadMask
Cf. the old results Each sample counts as 0.01
seconds. cumulative self
self total time seconds
seconds calls s/call s/call name
64.91 484.05 484.05 1 484.05
536.54 spatial_convolve 13.21 582.54
98.49 1 98.49 98.49 spreadMask
59Why didnt that help with
spreadMask?
- for (l -w2 l lt w2 l)
- jj j l
- if (jj lt 0 jj gt rPixY_l)
- continue
-
- mDataiirPixX_ljj FLAG_OK_CONV(!(mDa
taiirPixX_ljj -
FLAG_INPUT_ISBAD)) -
-
mDataiirPixX_ljj FLAG_OK_CONV(!(mDataiir
PixX_ljj
FLAG_INPUT_ISBAD))
Memorys being addressed in the wrong order
60Moral Lessons
- Understand your computer and your languages
- Dont be sloppy think!
- Go forth and multiply