FFTW and the SiCortex Architecture - PowerPoint PPT Presentation

1 / 18
About This Presentation
Title:

FFTW and the SiCortex Architecture

Description:

Simple but decently optimized radix-842 transformation that does rows, then cols ... Choose radix equal to np if possible ... What's the best initial radix? ... – PowerPoint PPT presentation

Number of Views:36
Avg rating:3.0/5.0
Slides: 19
Provided by: poru
Category:

less

Transcript and Presenter's Notes

Title: FFTW and the SiCortex Architecture


1
FFTW and the SiCortex Architecture
  • Po-Ru Loh
  • MIT 18.337
  • May 8, 2008

2
The email that launched a thousand processors
  • Part of my work here at SiCortex is to work with
    the FFT libraries we provide to our customers....
    When I do a comparison of performance of the
    serial version of FFTW 2.1.5 and 3.2 alpha, FFTW
    3.2 alpha is significantly slower.
  • Both codes are compiled with the same compiler
    flags. I am running the 64 bit version
  • CCscpathcc LDscld ARscar RANLIBscranlib
    F77scpathf95
  • CFLAGS"-g -O3 -OPTOfast -mips64"
    MPILIBS"-lscmpi"
  • ./configure --buildx86_64-pc-linux-gnu
  • --hostmips64el-gentoo-linux-gnu --enable-fma
  • FFTW 3.2alpha
  • bench -opatient -s icf2048x2048
  • Problem icf2048x2048, setup 3.79 s, time 4.21
    s, mflops'' 109.7
  • FFTW 2.1.5
  • Please wait (and dream of faster computers).
  • SPEED TEST 2048x2048, FFTW_FORWARD, in place,
    specific
  • time for one fft 2.914490 s (694.868565
    ns/point)
  • "mflops" 5 (N log2 N) / (t in microseconds)
    158.303319

3
Where to start?
  • Maybe try benchmarking FFTW for myself
  • But I'm too lazy to write benchmarking code,
    especially if it already exists somewhere
  • Unfortunately, that somewhere happens to be
    inside the FFTW package
  • and it has its own dependencies
  • so I can't simply grab a file and link to the
    FFTW library on our SC648 ugh
  • Fine, I'll just download the whole package
  • compile the whole thing (with the default
    config, since I don't know how to do anything
    smarter)
  • and then hijack the last link step to relink
    against the "real" SC-optimized library

4
Some surprises
  • The numbers you sent previously (for a 2048x2048
    complex transform) were
  • fftw-2.1.5 155 mflops
  • fftw-3.2alpha 110 mflops
  • My initial tests didn't find a discrepancy nearly
    that big
  • fftw-2.1.5 140 mflops
  • fftw-3.1.2 120 mflops
  • fftw-3.2alpha3 130 mflops
  • Of course, these results are simply using the
    downloaded version of the libraries without the
    SiCortex tweaks you mentioned before (and also
    with gcc and not PathScale).

5
More surprises
  • Strangely, I then tried re-linking fftw_test and
    bench using the precompiled fftw libraries that
    came on the machine and got a substantial
    slowdown!
  • /usr/lib/libsfftw.a 2.1.5 45 mflops
  • /usr/lib/libfftw3.la 3.2alpha 90 mflops
  • One other thing I noticed was that the configure
    script for fftw3 complained about not finding a
    hardware cycle counter
  • But fftw2 didn't produce this warning -- perhaps
    it doesn't use it.
  • Does a cycle counter actually exist (and do I
    just need to tell configure where it is)?

6
Meanwhile, let's play with FFTW and our machine
  • In the end, the algorithms make the difference,
    but we need to know where to look
  • Previously, I thought about algorithms in terms
    of O(n log n) flops
  • Need to get a feel for what's really important

7
How fast is FFTW anyway? Is there any hope of
doing better?
  • Grab a competitor from Jorg's useful and ugly
    FFT page
  • 2-dim FFT modernized and cleaned up by Stefan
    Gustavson
  • Simple but decently optimized radix-842
    transformation that does rows, then cols
  • Output time taken for just rows, then total time
    taken
  • ploh_at_sc1-m3n6 /kube-gustavson ./a.out
  • Enter number of rows 1024
  • Enter number of cols 1024
  • Time elapsed 2.040359 sec
  • Time elapsed 4.149302 sec
  • ploh_at_sc1-m3n6 /kube-gustavson ./a.out
  • Enter number of rows 2048
  • Enter number of cols 2048
  • Time elapsed 8.482921 sec
  • Time elapsed 17.633655 sec
  • ploh_at_sc1-m3n6 /kube-gustavson ./a.out
  • Enter number of rows 4096
  • Enter number of cols 4096
  • Time elapsed 37.851723 sec

8
How fast is FFTW anyway? Is there any hope of
doing better?
  • How about this O3 business?
  • ploh_at_sc1-m3n6 /kube-gustavson gcc -c
    timed-kube-gustavson-fft.c -O3
  • ploh_at_sc1-m3n6 /kube-gustavson gcc test_kube.c
    timed-kube-gustavson-fft.o -lm
  • ploh_at_sc1-m3n6 /kube-gustavson ./a.out
  • Enter number of rows 1024
  • Enter number of cols 1024
  • Time elapsed 0.555099 sec
  • Time elapsed 1.172668 sec
  • ploh_at_sc1-m3n6 /kube-gustavson ./a.out
  • Enter number of rows 2048
  • Enter number of cols 2048
  • Time elapsed 2.321096 sec
  • Time elapsed 5.325778 sec
  • ploh_at_sc1-m3n6 /kube-gustavson ./a.out
  • Enter number of rows 4096
  • Enter number of cols 4096
  • Time elapsed 11.917786 sec
  • Time elapsed 28.651167 sec

9
How fast is FFTW anyway? Is there any hope of
doing better?
  • And how about SiCortexs optimized math library?
  • ploh_at_sc1-m3n6 /kube-gustavson gcc test_kube.c
    timed-kube-gustavson-fft.o -O3 -lscm -lm
  • ploh_at_sc1-m3n6 /kube-gustavson ./a.out
  • Enter number of rows 1024
  • Enter number of cols 1024
  • Time elapsed 0.540417 sec
  • Time elapsed 1.143589 sec
  • ploh_at_sc1-m3n6 /kube-gustavson ./a.out
  • Enter number of rows 2048
  • Enter number of cols 2048
  • Time elapsed 2.260734 sec
  • Time elapsed 5.208607 sec
  • ploh_at_sc1-m3n6 /kube-gustavson ./a.out
  • Enter number of rows 4096
  • Enter number of cols 4096
  • Time elapsed 11.729611 sec
  • Time elapsed 28.277497 sec

10
How fast is the machine anyway? How speedy can
we hope to get?
  • 150 mflops sounds really slow compared to
    benchFFT graphs!
  • http//www.fftw.org/speed/CoreDuo-3.0GHz-icc64/
  • Why 500MHz MIPS processors
  • Capable of performing two instructions per
    cycle, but still just 500MHz
  • (On the other hand, the machine is cool.
  • Very cool 1W per processor!
  • More like 3W counting interconnect and everything
    else needed to keep a processor happy, but still
    just 16 kW for 5832 procs.)
  • So if you want to run a serial FFT, grab a
    desktop -- even a five-year old desktop.

11
Let's talk parallel
  • Meanwhile, I've been trying to get a feel for
    what our SC648 has to offer in terms of relative
    speeds of communication and computation. (Mainly
    I wanted a clearer picture of where to look for
    parallel FFT speedup.) A couple numbers I got
    were a little surprising -- please comment if you
    have insight.
  • Point-to-point communication speed for MPI
    Isend/Irecv nears 2GB/sec as advertised for
    message sizes in the MB range. Latency is such
    that a 32KB message achieves half the max.
  • The above speeds seem to be mostly independent of
    the particular pair of processors which are
    trying to talk, except that in a few cases --
    apparently when the processors belong to the same
    6-CPU node -- top speed is only 1GB/sec. This is
    the opposite of what I (perhaps naively) expected
    -- any explanation?
  • Where's the Kautz graph???

12
Let's talk parallel
  • Upon introducing network traffic by asking 128
    processors to simultaneously perform 64 pairwise
    communications, speed drops to around 0.5GB/sec
    with substantial fluctuations.
  • For comparison, the speed of memcpy on a single
    CPU is as follows
  • Size (bytes) Time (sec) Rate (MB/sec)
  • 256 0.000000 1152.369380
  • 512 0.000000 1367.888343
  • 1024 0.000001 1509.023977
  • 2048 0.000001 1591.101956
  • 4096 0.000003 1421.082913
  • 8192 0.000006 1436.982715
  • 16384 0.000010 1638.373220
  • 32768 0.000097 337.867886
  • 65536 0.000194 337.250067
  • 131072 0.000413 317.622002
  • 262144 0.001054 248.758781
  • 524288 0.002281 229.862557
  • 1048576 0.004371 239.917118
  • 2097152 0.008626 243.109613

13
A trip to Maynard
  • Sample output from FFTW 3.1.2
  • 3 sec planning, 30 sec for the transform, 3
    mflops?!
  • But I got 100 mflops on my unoptimized version!
  • Try rebuilding with a couple different config
    options
  • Aha! Blame --enable-long-double
  • Still off by a factor of 2 from FFTW 2.1.5 the
    missing cycle.h probably accounts for some of
    that but probably not all of it

14
Back to algorithms What plans does FFTW actually
choose?
  • I found out that turning on the "verbose" option
    (fftw_test -v for 2.x and bench -v5, say, for
    3.x) outputs the plan. Based on a handful of
    test runs on power-of-2 transforms, FFTW 2.1.5
    seems to prefer
  • radix 16 and radix 8, for transform lengths up to
    about 215
  • primarily radix 4 but finishing off with three
    radix 16 steps, for larger transforms.
  • There are also occasional radix 2 and 32 steps
    but these seem to be rare.
  • In practice, FFTW 2 for 1D power-of-2 transforms
    seems to just try the various possible radices up
    to 64 -- nothing fancier than that. FFTW 3 tries
    a lot more things, including a radix-\sqrtn
    first step.
  • Last week we weren't totally satisfied with the
    258 "mflops" we were getting from Andy's
    optimized FFTW 2.1.5 however, the transform we
    were running was fairly large, and in fact Andy's
    FFTW 2 gets 500 "mflops" on peak sizes (around
    1024). Basically, there's a falloff in
    performance as soon as the L1 cache is exceeded
    and another one once we start reaching into main
    memory. All of the graphs on the benchFFT page
    show these falloffs indeed, getting a peak
    "mflops" number slightly better than the clock
    rate is about the best anyone can do (on non-SIMD
    processors) -- congrats.
  • On the other hand, we still don't really have a
    working FFTW 3 without a cycle counter, the code
    currently does no timing whatsoever and instead
    chooses plans based on a heuristic. Hopefully we
    can get that straightened out soon, because at
    the moment its planner information is pretty
    useless to us!
  • Once that's fixed we'll see how its speed
    compares to FFTW 2. Either way, I'd say FFTW 2
    is pretty well-optimized (although the library
    you're currently distributing with your machines
    is not!) and our focus should really be on FFTW
    3. In particular, the algorithms that FFTW 2
    uses for parallel transforms are just plain slow
    (I can elaborate on that if you wish). On the
    bright side, 3.2alpha already implements several
    improvements -- the first three or four
    possibilities that came to mind for me are all
    already there -- and despite being in alpha, it's
    in good enough shape to start playing with, once
    we have a cycle counter.

15
What does parallel FFTW actually do?
  • FFTW 2 Transpose, 1D FFTs, transpose, 1D FFTs,
    transpose -- straight from the book
  • Choose radix as close to \sqrtn as possible
  • Example 1D complex double forward transform of
    size 224 (with workspace)
  • "Parallel" algorithm on 1 proc 25 slowdown
  • 2 procs 1.5x speedup
  • 64 procs 29.900830x
  • 128 procs 67.845401x
  • 256 procs 119.773536x (12 GFlops)
  • 512 procs 143.338615x (15 GFlops)
  • FFTW 3.2alpha Same general "six-step" algorithm,
    but with twists
  • Choose radix equal to np if possible
  • Skip local transposes -- let FFTW decide whether
    or not to do them
  • Consider doing transposes in stages to avoid
    all-to-all communication
  • Example (sans cycle counter!)
  • Problem ocf16777216, setup 18.27 s, time
    308.69 ms, mflops'' 6522

16
Ideas for improvement
  • Serial
  • Micro level
  • Low-level vectorization for cache line efficiency
    and SIMD
  • Macro level
  • Investigate large-radix steps (and include them
    in tuning?)
  • Stop wasting time waiting for memory
  • Multithreading -- even on one processor -- to
    compute while waiting?
  • Prefetching?
  • Parallel
  • What's the best initial radix?
  • What's the best number of communication steps to
    use in a transpose?
  • Can we do FFTW-esque self-tuning?
  • Latency-hiding by reordering work
  • Do half the computation in row-col order, other
    half in col-row order so that processors always
    have work to do
  • Exploit the fact that the DMA Engine/Fabric
    Switch is a "processor" that takes care of
    sending and receiving messages
  • Load-balancing by having pairs of processors work
    together at each step?

17
How we know theres more to do
  • Serial
  • The edge-of-cache cliff http//www.fftw.org/speed
    /CoreDuo-3.0GHz-icc64/
  • Parallel
  • HPCC has adopted FFTE, which claims it's faster
    on large transforms
  • Whether or not that's true, everything is slow in
    parallel
  • SiCortex talk 174 GFlops for HPCC FFTE(?) on
    5832 500MHz procs
  • University of Tsukuba Center for Computational
    Sciences poster (2006)13 GFlops for FFTE on 32
    3.0GHz procs
  • http//www.rccp.tsukuba.ac.jp/SC/sc2006/Posters/16
    _hpcs-ol.pdf
  • Intel Math Kernel Library 25 GFlops on 64
    dual-core 3.0GHz procs
  • Intel claims theyre winning on all fronts
  • http//www.intel.com/cd/software/products/asmo-na/
    eng/266852.htm

18
Anyone want to help write a faster parallel FFT? ?
  • Let me know!
  • Joke of the day You know you've been spending
    too much time thinking about parallel computing
    when
  • You cite "multithreading between different class
    projects" as the reason for slow progress on each
    (and lament the fact that you have but one
    processor to apply to an embarrassingly parallel
    problem)
  • You start thinking about time wasted switching
    between tasks in terms of memory reads and
    writes, and it dawns on you that a
    cache-oblivious algorithm could help
  • You realize that LRU is a pretty good model for
    cramming (and wish you could achieve nanosecond
    latencies)
Write a Comment
User Comments (0)
About PowerShow.com