Reducing number of operations: The joy of algebraic transformations - PowerPoint PPT Presentation

About This Presentation
Title:

Reducing number of operations: The joy of algebraic transformations

Description:

Reducing number of operations: The joy of algebraic transformations CS498DHP Program Optimization – PowerPoint PPT presentation

Number of Views:37
Avg rating:3.0/5.0
Slides: 44
Provided by: DavidP282
Category:

less

Transcript and Presenter's Notes

Title: Reducing number of operations: The joy of algebraic transformations


1
Reducing number of operationsThe joy of
algebraic transformations
  • CS498DHP Program Optimization

2
Number of operations and execution time
  • Fewer number of operations does not necessarily
    mean shorter execution times.
  • Because of scheduling in a parallel environment.
  • Because of locality.
  • Because of communication in a parallel program.
  • Nevertheless, although it has to be applied
    carefully, reducing the number of operations is
    one of the important optimizations.
  • In this presentation, we discuss transformation
    to reduce the number of operations or reduce the
    length of scheduling in an idealized parallel
    environment where communication costs are zero.

3
Scheduling
  • Consider the expression tree
  • It can be shortened by applying
  • Associativity and commutativity
    ahb(cgdef) or
  • Associativity, commutativity and distributivity
    ahbcbgbdef.
  • The second expression is the sortest of the
    three. This means that with enough resources the
    third expression is the fastest although is has
    the most operations.

4
Locality
  • Consider
  • do i1.n
  • c(i) a(i)b(i)a(i)/b(i)
  • end do
  • do i1,n
  • x(i) (a(i)b(i))t(i)a(i)/b(i)
  • end do

do i1,n d(i) a(i)/b(i) c(i)
a(i)b(i)d(i) end do do i1,n x(i)
(a(i)b(i))t(i)d(i) end do
  • The sequence on the right executes fewer
    operations, but, if n is large enough, it also
    incurs in more cache misses. (We assume that t
    is computed between the two loops so that they
    cannot be fused.)

5
Communication in parallel programs
cobegin do i1,n a(i) .. end do //
do i1,n a(i) .. end do
coend
  • Consider
  • cobegin
  • do i1,n
  • a(i) ..
  • end do
  • send a(1n)
  • //
  • receive a(1n)
  • coend
  • The sequence on the right executes more operation
    s, but it would execute faster if the send
    operation is expensive.

6
Approaches to reducing cost of computation
  • Eliminate (syntactically) redundant computations.
  • Apply algebraic transformations to reduce the
    number of operations.
  • Decompose sequential computations for parallel
    execution.
  • Apply algebraic transformations to reduce the
    height of expressions trees and thus reduce
    execution time in a parallel environment.

7
Elimination of redundant computations
  • Many of the transformations were discussed in the
    context of compiler transformations.
  • Common subexpression elimination
  • Loop invariant removal
  • Elimination of redundant counters
  • Loop unrolling (not discussed, but should have).
    It eliminates bookkeeping operations.

8
  • However, compilers will not eliminate all
    redundant computations. Here is an example where
    user intervention is needed
  • The following sequence
  • do i1,n
  • s a(i)s
  • end do
  • do i1,n-1
  • t a(i)t
  • end do
  • t

9
  • May be replaced by
  • do i1,n-1
  • t a(i)t
  • end do
  • sta(n)
  • t
  • This transformation is not usually done by
    compilers.

10
  • Another example, from C, is the loop
  • for (i 0 i lt n i)
  • for (j 0 j lt n j)
  • ai,j0
  • Which, if a is n n, can be transformed into
    the loop below that has fewer bookkeeping
    operations.
  • ba
  • for (i 0 i lt nn i)
  • b0 b

11
Applying algebraic transformations to reduce the
number of operations
  • For example, the expressions a(bc)(ba)dae
    can be transformed into (ab)(cd)ae by
    distributivity and then by associativity and
    distributivity into a(b(cd)e).
  • Notice that associativity has to be applied with
    care. For example, suppose we are operating on
    floating point values and that x is very much
    larger than y and z-x. Then (yx)z may give 0
    as a result, while y(xz) gives y as an answer.

12
  • The application of algebraic rules can be very
    sophisticated. Consider the computation of xn. A
    naïve implementation would require n-1
    multiplications.
  • However, if we represent n in binary as
    nb02(b12(b2 )) and notice that xnxb0
    (xb12(b2 ))2, the number of multiplications
    can be reduced to O(log n).

13
  • function power(x,n) (assume ngt0)
  • if n1 then return x
  • if n21 then return xpower(x,n-1)
  • else xpower(x,n/2) return xx

14
Horners rule
  • A polynomial
  • A(x) a0 a1x a2x² a3x³ ...
  • may be written as
  • A(x) a0 x(a1 x(a2 x(a3 ...))).
  • As a result, a polynomial may be evaluated at a
    point x', that is A(x') computed, in T(n) time
    using Horner's rule. That is, repeated
    multiplications and additions, rather than the
    naive methods of raising x to powers, multiplying
    by the coefficient, and accumulating.

15
Conventional matrix multiplication
Asymptotic complexity 2n3 operationsEach
recursion step (blocked version) 8
multiplications, 4 additions                    
                                                  
                               
16
Strassens Algorithm
Asymptotic complexity O(nlog27) O(n2.8)
operationsEach recursion step 7
multiplications, 18 additions/subtractions
                                                  
                                                
                                            
Asymptotic complexity is solution of
T(n)7T(n/2)18(n/2)2
17
Winograd
Asymptotic complexity O(n2.8..)operationsEach
recursion step 7 multiplications, 15
additions/subtractions                          
                                               
                                                  
                                      
18
Parallel matrix multiplication
  • Parallel matrix multiplication can be
    accomplished without redundant operations.
  • First observe that the time to compute a sum of n
    elements, given enough resources, is
  • .

19
Time
20
Time
21
  • With sufficient replication and computational
    resources matrix multiplication can take just one
    multiplication step and additions

22
Copying can also be done in logarithmic steps
23
Parallelism and redundancy
  • Algebra rules can be applied to reduce tree
    height.
  • In some cases, the height of the tree is reduced
    at the expense of an increase in the number of
    operations

24
(No Transcript)
25
(No Transcript)
26
(No Transcript)
27
(No Transcript)
28
(No Transcript)
29
Parallel Prefix
30
(No Transcript)
31
(No Transcript)
32
(No Transcript)
33
(No Transcript)
34
(No Transcript)
35
(No Transcript)
36
(No Transcript)
37
(No Transcript)
38
Redundancy in parallel sorting.Sorting networks.
39
Comparator (2-sorter)
outputs
inputs
min(x, y)
x
max(x, y)
y
40
Comparison Network
n / 2 comparisons per stage
41
Sorting Networks
inputs
outputs
1
0
0
0
0
0
Sorting Network
1
0
0
1
0
1
1
1
1
1
42
Insertion Sort Network
inputs
outputs
depth 2n 3
43
    comparator stages     comparators
Odd-even transposition sort     O(n)     O(n2)
Bubblesort     O(n)     O(n2)
Bitonic sort     O(log(n)2)     O(nlog(n)2)
Odd-even mergesort     O(log(n)2)     O(nlog(n)2)
Shellsort     O(log(n)2)     O(nlog(n)2)
Write a Comment
User Comments (0)
About PowerShow.com