Fast Power Spectrum Estimation - PowerPoint PPT Presentation

1 / 15

About This Presentation

Title:

Fast Power Spectrum Estimation

Description:

Number of Views:271

Avg rating:3.0/5.0

Slides: 16

Provided by: Bre845

Category:

Tags: blas | estimation | fast | power | spectrum

Transcript and Presenter's Notes

Title: Fast Power Spectrum Estimation

1
Fast Power Spectrum Estimation

2
Constraining Cosmology

In the beginning, we start with an initial
Gaussian distribution of matter.
Overdense regions gravitationally attract more
matter and perturbations grow.
How these structures grow gives us information
about the underlying cosmological parameters that
govern this growth.

3
Calculating an Angular Power Spectrum

One of the favored methods of measuring the
distribution of matter is the power spectrum.
A famous power spectrum measurement from the
Cosmic Microwave Background helped determine the
geometry of the universe and gives information
about the normal and dark matter densities.
However, for the CMB the observations are all at
approximately the same distance.

4
Calculation of the Angular Power Spectrum

The number of galaxies per area of the sky
defines the covariance matrix.
The covariance matrix is also related to the
signal matrix which is defined in terms of the
power spectrum.
The idea is to find the right values of Cl that
produce the observed overdensities.

5
More Calculation

To accomplish this mathematically, we construct a
likelihood function, and maximize it.
Where Fll is the Fisher Information Matrix
This gives us an updated value of Cl, which we
then use to construct a new signal matrix, and we
repeat the process to iteratively converge on a
solution.

6
Overview of Code
7
Signal Matrix

8
G80 Signal

__global__ void signal_matrix(float signal, int
C_start, int C_stop, float cos_angle, int
bins)
int k
float accum
int index (blockIdx.yMAX_GRID_SIZEblockIdx.x
)BLOCK_SIZEthreadIdx.x
int i index bins
int j (index/bins) bins
int l index/(binsbins) (l_maxbinsbins)
__shared__ int Cstartl_max
__shared__ int Cstopl_max
Cstartl C_startl
Cstopl C_stopl
accum 0.0
for (kCstartl kltCstopl k)
accum ((2.0k1.0)/(2.0k(k1)))exp(-kk
ss)Legendre(cos_angleijbins, k)

9
Matrix Multiplication and Inversion

Multiplication
Matrix multiplication was implemented using
cublasSgemm() from the CUDA BLAS library. The
serial code used sgemm().
Inversion
Our best optimized serial code used CLAPACK
functions for matrix inversion.
We also wrote a serial matrix inversion code
using Gaussian elimination that was used for the
CUDA code and the serial version of the
algorithm.
Currently, our parallel code for matrix inversion
has not yet been implemented.

10
Limitations

Memory
This code allows us to run larger data sets which
increase the calculation time dramatically.
However, contrary to our initial speculation, we
ran out of host memory before filling the G80
when running large data sets.
Compiling
We also had difficulty linking other libraries
(in our case, CLAPACK) with the NVCC compiler.
Timing
Due to the above, we used gcc to compile the
serial versions, and the clock() function to
time, which introduced some timing precision loss
since it is only accuate to the nearest second.
This prevented us from measuring the speedup
between sgemm and cublasSgemm as well as our
matrix inversion code.
However, as the smallest data set took the
application several minutes to run on the CPU,
this was accurate enough for our purposes.

11
Results

Our largest absolute error was 0.000556, which is
0.6 of the value.
Our largest relative error was 0.8.
The smallest scientific error for this data set
is 0.002, with typical errors of about 0.01.
All differences in the versions were smaller than
5 of the scientific error.

12
Speedup

With a block size of 128, the largest data set
that we could time reached a speedup of 337x.
Note We were able to run the CUDA version for a
larger data set, but to time the serial version
on the TIP machines would have taken most of a
day. The CUDA version ran in 2.4 minutes.
The kernel in this case only includes the signal
matrix calculation, not matrix inversion or
multiplication.
This is highly suggestive that our serial version
could be better optimized.

13
Going Further

Memory
The great limitation here has switched from
calculation time, to memory requirements.
Our largest calculation used 800MB of host
memory, the next two resolutions require 10GB
and 160GB.
More GPUs
The signal matrix calculation, which is our main
source of speedup, is almost pure calculation and
each loop iteration is completely independent.
We could easily expand this across multiple GPUs
with very little alteration.
Scaling
At very large resolutions, matrices wont fit
into global memory for multiplications and
inversions. However, using block matrix
multiplication and inversion would allow the code
to circumvent this limitation.
Inversion Algorithms
Look into using LU or QR decomposition for matrix
inversion, and implement the parallelized version
of matrix inversion.
Mix/match
As this code has multiple parts to the kernel,
perhaps running some function on the CPU and some
on the GPU would produce a faster overall code.

14
New Science

The primary limitation for many power spectrum
calculations is computing time.
Producing better results can be accomplished by
either increasing the area surveyed or the
increasing the resolution.
Doubling either the area or the resolution
results in 16 times the memory requirements and
calculation time.
A speedup of 300x, would allow a quadrupling of
the area or resolution (or a combination of the
two), with some time to spare.
There is still a ways to go however, it is a very
big sky.

15
Summary

The lack of memory accesses and abundance of
independent calculations in the signal matrix
computation allowed for a 357x kernel speedup,
however, that may be overestimated due to the
lack of SSE2 instructions in the CPU code.
For smaller data sizes, the vast majority of the
time is spend in the signal matrix calculation
allowing for a 337x speedup. As the data size
increases, it slowly becomes more dependent on
matrix operations.
The results were more accurate than expected,
with the largest deviation from double precision
results being less than 5 of the scientific
error.
Questions?