Fast Power Spectrum Estimation - PowerPoint PPT Presentation

1 / 15
About This Presentation
Title:

Fast Power Spectrum Estimation

Description:

In the beginning, we start with an initial Gaussian distribution of matter. ... multiplication was implemented using cublasSgemm() from the CUDA BLAS library. ... – PowerPoint PPT presentation

Number of Views:271
Avg rating:3.0/5.0
Slides: 16
Provided by: Bre845
Category:

less

Transcript and Presenter's Notes

Title: Fast Power Spectrum Estimation


1
Fast Power Spectrum Estimation
  • Brett Hayes
  • and
  • Fatih Boyaci

2
Constraining Cosmology
  • In the beginning, we start with an initial
    Gaussian distribution of matter.
  • Overdense regions gravitationally attract more
    matter and perturbations grow.
  • How these structures grow gives us information
    about the underlying cosmological parameters that
    govern this growth.

3
Calculating an Angular Power Spectrum
  • One of the favored methods of measuring the
    distribution of matter is the power spectrum.
  • A famous power spectrum measurement from the
    Cosmic Microwave Background helped determine the
    geometry of the universe and gives information
    about the normal and dark matter densities.
  • However, for the CMB the observations are all at
    approximately the same distance.

4
Calculation of the Angular Power Spectrum
  • The number of galaxies per area of the sky
    defines the covariance matrix.
  • The covariance matrix is also related to the
    signal matrix which is defined in terms of the
    power spectrum.
  • The idea is to find the right values of Cl that
    produce the observed overdensities.

5
More Calculation
  • To accomplish this mathematically, we construct a
    likelihood function, and maximize it.
  • Where Fll is the Fisher Information Matrix
  • This gives us an updated value of Cl, which we
    then use to construct a new signal matrix, and we
    repeat the process to iteratively converge on a
    solution.

6
Overview of Code
7
Signal Matrix
  • for (l0 lltl_max l)
  • for (j0 jltheight j)
  • for (i0 iltwidth i)
  • signalijwidthlwidthheight 0.0
  • for (kC_startl kltC_stopl k)
  • signalijwidthlwidthheight
    (2.0k1.0)/(2.0k(k1))exp(kkss)
  • Legendre(cos_angleijwidth, k)
  • For small data sets, the great majority of
    computation time is spent calculating this
    matrix. However, since the calculational
    complexity of matrix multiplication and inversion
    scales as O(N3), this O(N2) matrix construction
    becomes comparatively less of the computation
    time.

8
G80 Signal
  • __global__ void signal_matrix(float signal, int
    C_start, int C_stop, float cos_angle, int
    bins)
  • int k
  • float accum
  • int index (blockIdx.yMAX_GRID_SIZEblockIdx.x
    )BLOCK_SIZEthreadIdx.x
  • int i index bins
  • int j (index/bins) bins
  • int l index/(binsbins) (l_maxbinsbins)
  • __shared__ int Cstartl_max
  • __shared__ int Cstopl_max
  • Cstartl C_startl
  • Cstopl C_stopl
  • accum 0.0
  • for (kCstartl kltCstopl k)
  • accum ((2.0k1.0)/(2.0k(k1)))exp(-kk
    ss)Legendre(cos_angleijbins, k)

9
Matrix Multiplication and Inversion
  • Multiplication
  • Matrix multiplication was implemented using
    cublasSgemm() from the CUDA BLAS library. The
    serial code used sgemm().
  • Inversion
  • Our best optimized serial code used CLAPACK
    functions for matrix inversion.
  • We also wrote a serial matrix inversion code
    using Gaussian elimination that was used for the
    CUDA code and the serial version of the
    algorithm.
  • Currently, our parallel code for matrix inversion
    has not yet been implemented.

10
Limitations
  • Memory
  • This code allows us to run larger data sets which
    increase the calculation time dramatically.
  • However, contrary to our initial speculation, we
    ran out of host memory before filling the G80
    when running large data sets.
  • Compiling
  • We also had difficulty linking other libraries
    (in our case, CLAPACK) with the NVCC compiler.
  • Timing
  • Due to the above, we used gcc to compile the
    serial versions, and the clock() function to
    time, which introduced some timing precision loss
    since it is only accuate to the nearest second.
    This prevented us from measuring the speedup
    between sgemm and cublasSgemm as well as our
    matrix inversion code.
  • However, as the smallest data set took the
    application several minutes to run on the CPU,
    this was accurate enough for our purposes.

11
Results
  • Our largest absolute error was 0.000556, which is
    0.6 of the value.
  • Our largest relative error was 0.8.
  • The smallest scientific error for this data set
    is 0.002, with typical errors of about 0.01.
  • All differences in the versions were smaller than
    5 of the scientific error.

12
Speedup
  • With a block size of 128, the largest data set
    that we could time reached a speedup of 337x.
  • Note We were able to run the CUDA version for a
    larger data set, but to time the serial version
    on the TIP machines would have taken most of a
    day. The CUDA version ran in 2.4 minutes.
  • The kernel in this case only includes the signal
    matrix calculation, not matrix inversion or
    multiplication.
  • This is highly suggestive that our serial version
    could be better optimized.

13
Going Further
  • Memory
  • The great limitation here has switched from
    calculation time, to memory requirements.
  • Our largest calculation used 800MB of host
    memory, the next two resolutions require 10GB
    and 160GB.
  • More GPUs
  • The signal matrix calculation, which is our main
    source of speedup, is almost pure calculation and
    each loop iteration is completely independent.
    We could easily expand this across multiple GPUs
    with very little alteration.
  • Scaling
  • At very large resolutions, matrices wont fit
    into global memory for multiplications and
    inversions. However, using block matrix
    multiplication and inversion would allow the code
    to circumvent this limitation.
  • Inversion Algorithms
  • Look into using LU or QR decomposition for matrix
    inversion, and implement the parallelized version
    of matrix inversion.
  • Mix/match
  • As this code has multiple parts to the kernel,
    perhaps running some function on the CPU and some
    on the GPU would produce a faster overall code.

14
New Science
  • The primary limitation for many power spectrum
    calculations is computing time.
  • Producing better results can be accomplished by
    either increasing the area surveyed or the
    increasing the resolution.
  • Doubling either the area or the resolution
    results in 16 times the memory requirements and
    calculation time.
  • A speedup of 300x, would allow a quadrupling of
    the area or resolution (or a combination of the
    two), with some time to spare.
  • There is still a ways to go however, it is a very
    big sky.

15
Summary
  • The lack of memory accesses and abundance of
    independent calculations in the signal matrix
    computation allowed for a 357x kernel speedup,
    however, that may be overestimated due to the
    lack of SSE2 instructions in the CPU code.
  • For smaller data sizes, the vast majority of the
    time is spend in the signal matrix calculation
    allowing for a 337x speedup. As the data size
    increases, it slowly becomes more dependent on
    matrix operations.
  • The results were more accurate than expected,
    with the largest deviation from double precision
    results being less than 5 of the scientific
    error.
  • Questions?
Write a Comment
User Comments (0)
About PowerShow.com