Title: Multi Processing
1Multi Processing Matrix Math Linear
Algebra Or Why Work Twice as Hard to Write Ten
Times as Much Code That No One Can Even
Follow by Brad Morantz PhD
2Goals
- Solve the problem clearly and efficiently
- Focus is on the problem
- Generate code that
- Runs fast
- Parallel processing code
- Is easy to understand
- Written in math
- Understandable by mathematicians and SME (subject
matter experts)? - Will be able to understand it in six months
- Is standard and portable
- Taken anywhere
- Run on any platform
- That is close to the mathematical formulation
- Code that is clear and concise
3Typical Serial Processing
It can take a while until everyone has had a drink
4Parallel Processing
Many drinking fountains
This will process three times as fast
Queuing would be even faster If the processes
were of varying time
The compiler does this for you automatically
5Example
- I want to declare some matrices, fill them
- with values calculated in long equations, and
then multiply them. - And then test this program on various machines
and with varying number of processors. - I did this and got interesting results (see
- pg 7)?
6The code
- program tryitout
- implicit none
- ! written by brad M to see if this works
- real16, dimension(,), allocatable A, B, C
! 16 byte floating point matrix, 2D - integer4 row, i, j
- real4 starttime, donetime
- print, 'What size array? Start with square
array' - read, row
- allocate(A(row,row), B(row,row), C(row,row))?
- call cpu_time(starttime)?
- blockit forall (i 1row1, j 1row1)?
- A(i, j) ((real(i)) 2) sin(real(i)/real(j))?
- B(i, j) (real(i))/(real(j)) cos (real(i/j))?
- end forall blockit
7Results
- For 1000 x 1000 matrix
- On my old Athlon 2000 (Linux)?
- 142 seconds
- My dual Athlon 1800 (Linux)?
- One processor 152 seconds
- Two processors 76.5 seconds
- My wifes Athlon 3000-64 (Linux)?
- 5 seconds
- Old Job on Dual Quad Core Xeon (Linux)?
- One processor 2.8 seconds
- Eight processors 0.24 seconds
- My desk machine here (Core duo Windows)?
- 126.25 seconds
8The Solution
- Design and develop parallel algorithm
- Test out algorithm
- Make flowchart
- Develop the algorithm in Matlab or Mathematica
- Then put it into Fortran 95/2003
- VV
- Mathematician to validate algorithm
- Test matrix and verify output
9Think Parallel(the hardest part)?
- Programmer must think parallel
- Think matrices not scalars
- Think parallel not serial
- Think wide not narrow
- Algorithm must be parallel
- Think in parallel mode
- Will this go into a matrix
- How can I put this into a vector
- How can this be made more parallel
- How can this be done in parallel mode
10Please . . .
- Do not confuse this with the old Fortran
- Keep an open mind
- Realize that this is for mathematics
- Think in matrices
- Think about parallel operation
- Think What if the J-3 committee knew Matlab when
they wrote the spec for F95 - My focus
- solve a math problem
- get it done quickly
- with a highly portable code
11Where is parallelization?
- The old way (uses 1 cpu)?
- Do 100 J1, 1000, 1
- X(J) J2
- 100 continue
- New way (uses all CPUs)?
- Forall (j 110001) X(j) j2
- What is really happening? (Compiler_at_work)?
- The compiler first checks for dependence
- Then it divides the 1000 by the of CPUs, e.g.4
- Then it puts the code across the processors
12Suppose . .
- You had a matrix of numbers (a whole bunch of
numbers) and wanted a slice of it - And there were some missing numbers (signified by
the value 9999.0)? - And you wanted to get the mean of columns in this
slice, creating a new matrix of 1 less dimension - And you wanted to utilize all of your processors.
- And you want the code to be maintainable, easy to
understand, portable
13Matrix
1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 2 3 4 5 6 7 8 9
8 7 6 5 4 3 2 1
1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 2 3 4 5 6 7 8 9
8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1
2 3 4 5 6 7 8 9 0 8 7 6 5 4 3 2
1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 2 3 4 5 6 7 8 9
8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1
2 3 4 5 6 7 8 9 0 8 7 6 5 4 3 2
1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 2 3 4 5 6 7 8 9
8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1
2 3 4 5 6 7 8 9 0 8 7 6 5 4 3 2
1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 2 3 4 5 6 7 8 9
8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1
2 3 4 5 6 7 8 9 0 8 7 6 5 4 3 2
slice of the matrix
1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 2 3 4 5 6 7 8 9
8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1
2 3 4 5 6 7 8 9 0 8 7 6 5 4 3 2
1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 2 3 4 5 6 7 8 9
8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1
2 3 4 5 6 7 8 9 0 8 7 6 5 4 3 2
1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 2 3 4 5 6 7 8 9
8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1
2 3 4 5 6 7 8 9 0 8 7 6 5 4 3 2
1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 2 3 4 5 6 7 8 9
8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1
2 3 4 5 6 7 8 9 0 8 7 6 5 4 3 2
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . . . .
. . . . . . . . . . . . . . . . . . . . . . .
It could be more than the 2 dimensions shown here
1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 2 3 4 5 6 7 8 9
8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1
2 3 4 5 6 7 8 9 0 8 7 6 5 4 3 2
1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 2 3 4 5 6 7 8 9
8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1
2 3 4 5 6 7 8 9 0 8 7 6 5 4 3 2
1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 2 3 4 5 6 7 8 9
8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1
2 3 4 5 6 7 8 9 0 8 7 6 5 4 3 2
1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 2 3 4 5 6 7 8 9
8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1
2 3 4 5 6 7 8 9 0 8 7 6 5 4 3 2
1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1 2 3 4 5 6 7 8 9
8 7 6 5 4 3 2 1 1 2 3 4 5 6 7 8 9 8 7 6 5 4 3 2 1
2 3 4 5 6 7 8 9 0 8 7 6 5 4 3 2
14In C or C
- How many LOCs (lines of code)?
- How long would it take?
- How much other stuff would you need?
- How long would it take to have working code?
- How easy would it be to maintain the code?
- Could people reading the code figure it out
easily? - Would it be standard?
- How portable?
- Would the code run in 5 or 10 years?
15In Fortran 95
- mean(row,) sum(inarray(firstlast,), dim1,
maskinarray(firstlast,) .NE. 9999.0) - ! note arrays must be conformable
- adjust() count(inarray(firstlast,) .EQ.
9999.0, dim1) - mean(row, ) mean(row,)/(last- first 1
adjust()) - ! this is code out of a program
-
- This does it all in these 3 lines.
- And it uses all of the processors.
- It is even doing it on slices out of a big
matrix. It creates a matrix of means. - Actual working code out of one of my programs
- Could have been in 1 line, but would have been
hard to read
16Compile Line
- g95 mean.f90 -o means -fbounds-check
- the last option will look for any out of bounds
conditions in any array and report what line of
code, which array, which dimension, limit, and
attained value. - Multiprocessing
- ifort mean.f90 -o means parallel
- In Windows use /Qparallel
- It can be set that it only parallelizes loops
more than some number N - it is this simple
- P.S G95 is promising to go to multiprocessing,
and it is free. - ifort is from Intel and is available on Raytheon
STARS
17Parallel Processing Matrix Functions
- ALL are all values in mask true?
- ANY are any values in mask true?
- COUNT dimensional reduction, mask ok
- MAXVAL reduction or scalar, mask OK
- MINVAL reduction or scalar, mask OK
- PRODUCT of element of array along DIM
- SUM reduction or scalar, mask OK
- MATMUL - matrix multiply
- MERGE- merge 2 array with mask
- SPREAD (Source, Dim, Ncopies)?
18More
- PACK(Array, Mask ,Vector)?
- UNPACK(Vector, Mask, Field)?
- TRANSPOSE
- CSHIFT -circular shift
- EOSHIFT end off shift
- MAXLOC first location of max value
- MINLOC first location of min value
- FORALL parallel looping
- WHERE masking function
- ELSEWHERE masking function
19More Array Functions
- LBOUND - inquiry
- UBOUND - inquiry
- SIZE - inquiry
- SHAPE - inquiry
- RESHAPE - transformational
- DOT_PRODUCT- dot product multiplication
- ALLOCATE assign storage
- ALLOCATED- inquiry
20Elemental Matrix Math
- A, B, C are matrices, same size
- C AB multiplication
- C AB addition
- C A/B division
- C A-B subtraction
- C 0.0 initialization
- Automatic multiprocessing
- If enabled
- Depending on compiler and machine
21Independence
- Compiler checks for independence
- One cell depends on value of another
- e.g A(x, y) A(x1, y)?
- Order of operation must be independent
- If compiler finds dependence in values or timing,
it will not parallelize
22Matrix Multiplication
- A,B,C are matrices
- C matmul(A,B)?
- Arrays must be of same type
- integer, real, complex, or logical
- n,m x m,k n,k
- m x m,k k
- Automatic multiprocessing
- depending on compiler and machine
- if enabled
23Addressing Matrices
- Address a row
- A(row,)?
- Address a column
- A(, col)?
- Address a slice
- A(df, h12)?
- Address a matrix
- A
- print, A
- Does this look like Matlab?
24Assignments
- A, B same size matrices
- A B
- A(fg, 3) B(fg, 5)?
- c A(x,y) B(p,q)?
- D A(13, 24) B(2325, 1618)?
- Automatic multiprocessing
- If enabled
- Target must be same shape and size as source
- Or source can be a scalar
25MAXVAL
- MAXVAL(ARRAY, DIM, MASK)?
- creates a new array or assigns
- 1 less dimension
- max value along dimension DIM
- corresponding to TRUE elements of MASK
- Ex maxivals MAXVAL(A, DIM2, MASKA.GE.0.0)
- MINVAL is the same, but gets the minimum
26MINLOC
- MINLOC(ARRAY, DIM, MASK)?
- Creates or assigns array of 1 less dimension
- Finds location of first element of ARRAY having
minimum value of elements as defined by MASK - DIM is optional
- Ex MINLOC(A, DIM1, A.GT.12)?
27FORALL
- Like the do loop, but uses all available CPUs
- FORALL(I1N, J1M) A(I,J) IJ
- Loop1 FORALL (I1N0.50)?
- WHERE (A(2I).GE.43), A(INT(I)) I2
- end loop1
- (Naming loops is optional, but when used,
compiler will match names)? - Part of HPF specification (Rice/MIT)?
- automatic multiprocessing
28WHERE
- Masked array assignment
- hot WHERE(A gt 27.32)?
- temp temp degree
- ELSEWHERE
- cold cold degree
- END WHERE hot
- WHERE (A lt 0.0) A -A
- Single line, creates non-negative matrix
- A could be N dimension matrix
- Can use multiprocessors
- No need for indices
29Impact
- Intel has 8 core processors
- Expect available in 2008
- Announced 80 core processor
- AMD will be try to match or exceed
- This is a preview of what is coming
- Neither makes a single core CPU any more
- New AMD has 4 cores on 1 die and has native 128
bit FP processor ( I just built a machine using
this new Phenom chip)?
30Overview Fortran 95/2003
- Automatic Parallelization
- Matrix functions
- Object oriented
- Complex math built in
- 16 byte floating point (32 byte complex)?
- Bit and string functions
- Structures and arrays
- ISO standard
- Backward compatibility
- Dynamic allocation
- From matlab code to Fortran in minutes
31Other Mathematical Features
- 128 bit (16 byte) variables/ 32 byte complex
- Compiler takes care of the work
- Complex math (on all complex numbers)?
- Arrays of structures
- Structures of arrays (including allocatable)?
- RECURSIVE
- Operator function overloading
- Modules
- PURE
- Floating point exception handling
- IEEE floating point math
32Resources
- www.g95.org
- free compiler, no multi-processing
- promises multiprocessing in future
- lots of links
- Intel (lots of info on optimization)?
- www.intel.com/software/products/compilers
- Free compiler for home use only
- With automatic multi-processing (APO)?
- Available on STARS
- www.fortran.com (here in Tucson)?
- Amazon.com
- Half.com
33Intel Notes
- It is Sometimes easier for the Fortan compiler to
optimize matrix code than for the C compiler
because the C standard permits more hidden
aliasing of pointers than does Fortran - I have the Intel Notes on this
34Summary
- Fortran 95/2003
- is for computationally intensive applications
- Provides fast processing
- Inherent parallelism
- Compiles to assembly code
- Runs efficiently
- Converts quickly and easily from Matlab etc.
- Deliverable tool
- Multiprocessing with little effort
35More Sample Code
- mean sum(array,dim 2)/cols
- ans A(,36, 2n-1.4x) B
- divisor real(top-bottom 1)?
- xbar(,spot) real(sum(numbers(bottomtop,
),dim1))/divisor - forall(ibottomtop,j1numobs)
- diff(j,i) real(numbers(i,j))-xbar(j,spot)?
- diffsq diff diff
- divisor real(top-bottom)?
- stddev(,spot) sqrt(sum(diffsq(,bottom
top),dim2)/divisor)? - ! This is real code out of a program
36Sample Math Program
- Program calculate
- ! declare variables
- real16 answer
- real8, dimension(,), allocatable arrayA,
arrayB - integer4 row, col, numvars
- logical, dimension(), allocatable exists
- ! do some work
- print, 'How many variables are in the data set?
integer only' - read(,) numvars
- allocate(exists(numvars))?
- exists .FALSE.
- !allocate array and then make sure empty
- loop1 do row 1, 25, 0.5
- answer answer row
- end do loop1
- col int(answer)?
- allocate (arrayA(numvars,col),arrayB(numvars,col))
? - arrayB sqrt(answer) arrayA
- ! this program is really nonsense
37Neural Net
- How many LOC's in C/C?
- Wait until you see it in Fortran 95/2003
- 4 lines!!
- Maintainable
- Mathematical
- Easy to understand
- Function linear . . . .
- Function logistic . .(sigmoidal or hyperbolic
tan)? - Mid logistic(matmul(input, weight1))?
- Out linear(matmul(mid, weight2))?
38Contact Information
- Brad Morantz
- Tel 520-665-7026
- In directory
- Web page www.machine-cognition.com
- Personal e-mail bradscientist_at_ieee.org
- As a decision scientist, my goal is to solve
- problems and find the best possible solution.
- The best answers are based upon knowledge
- and information.