Title: Introduction to PETSc
1Introduction to PETSc
- VIGRE Seminar, Wednesday, November 8, 2006
2Parallel Computing
- How (basically) does it work?
3Parallel Computing
- How (basically) does it work?
- Assign each processor a number
4Parallel Computing
- How (basically) does it work?
- Assign each processor a number
- The same program goes to all
5Parallel Computing
- How (basically) does it work?
- Assign each processor a number
- The same program goes to all
- Each uses separate memory
6Parallel Computing
- How (basically) does it work?
- Assign each processor a number
- The same program goes to all
- Each uses separate memory
- They pass information back and forth as necessary
7Parallel Computing
- Example 1 Matrix-Vector Product
8Parallel Computing
- Example 1 Matrix-Vector Product
and are inputs into the program. 0
and are inputs into the program. 1
and are inputs into the program. 2
9Parallel Computing
- Example 1 Matrix-Vector Product
The control node (0) reads in the matrix and distributes the rows amongst the processors. 0 (a, b, c)
The control node (0) reads in the matrix and distributes the rows amongst the processors. 1 (d, e, f)
The control node (0) reads in the matrix and distributes the rows amongst the processors. 2 (g, h, i)
10Parallel Computing
- Example 1 Matrix-Vector Product
The control node also sends the vector to each processors memory. 0 (a, b, c) (j, k, l)
The control node also sends the vector to each processors memory. 1 (d, e, f) (j, k, l)
The control node also sends the vector to each processors memory. 2 (g, h, i) (j, k, l)
11Parallel Computing
- Example 1 Matrix-Vector Product
Each processor computes its own dot product. 0 (a, b, c) (j, k, l) ajbkcl
Each processor computes its own dot product. 1 (d, e, f) (j, k, l) djekfl
Each processor computes its own dot product. 2 (g, h, i) (j, k, l) gjhkil
12Parallel Computing
- Example 1 Matrix-Vector Product
The processors send their results to the control node, which outputs . 0 (a, b, c) (j, k, l) ajbkcl
The processors send their results to the control node, which outputs . 1 (d, e, f) (j, k, l) djekfl
The processors send their results to the control node, which outputs . 2 (g, h, i) (j, k, l) gjhkil
13Parallel Computing
- Example 2 Matrix-Vector Product
Suppose for memory reasons each processor only has part of the vector. 0 (a, b, c) j
Suppose for memory reasons each processor only has part of the vector. 1 (d, e, f) k
Suppose for memory reasons each processor only has part of the vector. 2 (g, h, i) l
14Parallel Computing
- Example 2 Matrix-Vector Product
Before the multiply, each processor sends the necessary information elsewhere. 0 (a, b, c) j (k from 1) (l from 2)
Before the multiply, each processor sends the necessary information elsewhere. 1 (d, e, f) (j from 0) k (l from 2)
Before the multiply, each processor sends the necessary information elsewhere. 2 (g, h, i) (j from 0) (k from 1) l
15Parallel Computing
- Example 2 Matrix-Vector Product
After the multiply, the space is freed again for other uses. 0 (a, b, c) j
After the multiply, the space is freed again for other uses. 1 (d, e, f) k
After the multiply, the space is freed again for other uses. 2 (g, h, i) l
16Parallel Computing
- Example 3 Matrix-Matrix Product
The previous case illustrates how to multiply matrices stored across multiple processors. 0 (a, b, c) (j, k, l)
The previous case illustrates how to multiply matrices stored across multiple processors. 1 (d, e, f) (m, n, o)
The previous case illustrates how to multiply matrices stored across multiple processors. 2 (g, h, i) (p, q, r)
17Parallel Computing
- Example 3 Matrix-Matrix Product
Each column is distributed for processing in turn. 1) (a, b, c)(j, m, p)a 0 2) (a, b, c)(k, n, q)ß 3) (a, b, c)(l, o, r)?
Each column is distributed for processing in turn. 1) (d, e, f)(j, m, p)d 1 2) (d, e, f)(k, n, q)e 3) (d, e, f)(l, o, r)?
Each column is distributed for processing in turn. 1) (g, h ,i)(j, m, p)? 2 2) (g, h, i)(k, n, q)? 3) (g, h, i)(l, o, r)?
18Parallel Computing
- Example 3 Matrix-Matrix Product
The result is a matrix with the same parallel row structure as the first matrix and column structure as the right. 0 (a, ß, ?)
The result is a matrix with the same parallel row structure as the first matrix and column structure as the right. 1 (d, e, ?)
The result is a matrix with the same parallel row structure as the first matrix and column structure as the right. 2 (?, ?, ?)
19Parallel Computing
- Example 3 Matrix-Matrix Product
The original indices could also have been sub-matrices, as long as they were compatible. 0 (a, ß, ?)
The original indices could also have been sub-matrices, as long as they were compatible. 1 (d, e, ?)
The original indices could also have been sub-matrices, as long as they were compatible. 2 (?, ?, ?)
20Parallel Computing
- Example 4 Block Diagonal Product
Suppose the second matrix is block diagonal. 0 (A, B, C) (J, 0, 0)
Suppose the second matrix is block diagonal. 1 (D, E, F) (0, K, 0)
Suppose the second matrix is block diagonal. 2 (G, H, I) (0, 0, L)
21Parallel Computing
- Example 4 Block Diagonal Product
Much less information needs to be passed between the processors. 1) AJa 0 2) BKß 3) CL?
Much less information needs to be passed between the processors. 1) DJd 1 2) EKe 3) FL?
Much less information needs to be passed between the processors. 1) GJ? 2 2) HK? 3) IL?
22Parallel Computing
- When is it worth it to parallelize?
23Parallel Computing
- When is it worth it to parallelize?
- There is a time cost associated with passing
messages
24Parallel Computing
- When is it worth it to parallelize?
- There is a time cost associated with passing
messages - The amount of message passing is dependent on the
problem and the program (algorithm)
25Parallel Computing
- When is it worth it to parallelize?
- Therefore, the benefits depend more on the
structure of the problem and the program than on
the size/speed of the parallel network
(diminishing returns).
26Parallel Networks
- How do I use multiple processors?
27Parallel Networks
- How do I use multiple processors?
- This depends on the network, but
- Most networks use some variation of PBS, a job
scheduler, and mpirun or mpiexec.
28Parallel Networks
- How do I use multiple processors?
- This depends on the network, but
- Most networks use some variation of PBS, a job
scheduler, and mpirun or mpiexec. - A parallel program needs to be submitted as a
batch job.
29Parallel Networks
- Suppose I have a program myprog, which gets data
from data.dat, which I call in the following
fashion when only using one processor - ./myprog f data.dat
- I would write a file myprog.pbs that looks like
the following
30Parallel Networks
- PBS q compute (name of the processing queue
not necessary on all networks) - PBS -N myprog (the name of the job)
- PBS l nodes2ppn1,walltime001000 (number
of nodes and number of processes per node,
maximum time to allow the program to run) - PBS -o /home/me/mydir/myprog.out (where the
output of the program should be written) - PBS -e /home/me/mydir/myprog.err (where the
error stream should be written) - These are the headers that tell the job scheduler
how to handle your job.
31Parallel Networks
- Although what follows depends on the MPI software
that the network runs, it should look something
like this - cd PBS_O_WORKDIR (makes the processors run the
program in the directory where myprog.pbs is
saved) - mpirun machinefile PBS_NODEFILE np 2 myprog
f mydata.dat (tells the MPI software which
processes to use and how many processes to start
notice that command line arguments follows as
usual)
32Parallel Networks
- Once the .pbs file is written, it can be
submitted to the job scheduler with qsub - qsub myprog.pbs
33Parallel Networks
- Once the .pbs file is written, it can be
submitted to the job scheduler with qsub - qsub myprog.pbs
- You can check to see if your job is running with
the command qstat.
34Parallel Networks
- Some systems (but not all) will allow you to
simulate running your program in parallel on one
processor, which is useful for debugging - mpirun np 3 myprog f mydata.dat
35Parallel Networks
- What parallel systems are available?
36Parallel Networks
- What parallel systems are available?
- RTC Rice Terascale Cluster 244 processors.
37Parallel Networks
- What parallel systems are available?
- RTC Rice Terascale Cluster 244 processors.
- ADA Cray XD1 632 processors.
38Parallel Networks
- What parallel systems are available?
- RTC Rice Terascale Cluster 244 processors.
- ADA Cray XD1 632 processors.
- caamster CAAM department exclusive 8(?)
processors.
39PETSc
40PETSc
- What do I use PETSc for?
- File I/O with minimal understanding of MPI
41PETSc
- What do I use PETSc for?
- File I/O with minimal understanding of MPI
- Vector and matrix based data management (in
particular sparse)
42PETSc
- What do I use PETSc for?
- File I/O with minimal understanding of MPI
- Vector and matrix based data management (in
particular sparse) - Linear algebra routines familiar from the famous
serial packages
43PETSc
- At the moment, ada and caamster (and harvey) have
PETSc installed
44PETSc
- At the moment, ada and caamster (and harvey) have
PETSc installed - You can download and install PETSc on your own
machine (requires cygwin for Windows), for
educational and debugging purposes
45PETSc
- PETSc builds on existing software BLAS and
LAPACK which implementations to use can be
specified at configuration
46PETSc
- PETSc builds on existing software BLAS and
LAPACK which implementations to use can be
specified at configuration - Has (slower) debugging configuration and (faster,
tacit) optimized configuration
47PETSc
- Installation comes with documentation, examples,
and manual pages.
48PETSc
- Installation comes with documentation, examples,
and manual pages. - The biggest part of learning how to use PETSc is
learning how to use the manual pages.
49PETSc
- It is extremely useful to have an environmental
variable PETSC_DIR in you shell of choice, which
gives the path to the installation of PETSc, e.g. - PETSC_DIR/usr/local/src/petsc-2.3.1-p13/
- export PETSC_DIR
50PETSc
51PETSc
- Makefile
- You can pretty much copy/paste/modify the
makefiles in the examples, but here is the basic
setup
52PETSc
- Makefile
- () (Other definitions for CFLAGS, etc.)
- LOCDIR /mydir
- include PETSC_DIR/bmake/common/base
- (This is why it is useful to have this variable
saved) - myprog myprog.o chkopts
- -CLINKER -o myprog myprog.o
PETSC_LIB - RM myprog.o
53PETSc
54PETSc
- Headers
- include petsc.h in all files, unless the
routines that you use need more specific headers.
55PETSc
- Headers
- include petsc.h in all files, unless the
routines that you use need more specific headers. - How do you know? Consult the manual pages!
56PETSc
57PETSc
- Data Types
- PETSc has a slew of its own data types PetscInt,
PetscReal, PetscScalar, etc.
58PETSc
- Data Types
- PETSc has a slew of its own data types PetscInt,
PetscReal, PetscScalar, etc. - Usually aliases of normal data types PetscInt
int, PetscReal double
59PETSc
- Data Types
- PETSc has a slew of its own data types PetscInt,
PetscReal, PetscScalar, etc. - Usually aliases of normal data types PetscInt
int, PetscReal double - Safer to use for compatibility
60PETSc
61PETSc
- Usage in C/C
- The top program should begin
- Static char helpYour message here.
- int main(int argc,char argv)
- ( declarations)
- PetscInitialize(argc,argv,PETSC_NULL,help)
62PETSc
- Usage in C/C
- The top program should end
- ()
- PetscFinalize()
- return(0)
63PETSc
- Usage in C/C
- When first programming, include the following
variable - PetscErrorCode ierr
- Where youd call a PETSc routine,
- Routine(arg)
- write instead
- ierrRouting(arg)CHKERRQ(ierr)
64PETSc
- Usage in C/C
- When you try to run your program, you will be
informed of any problems with incompatible data
types/dimensions/etc.
65PETSc
- Data
- Anything data type larger than a scalar has a
Create and a Destroy routine.
66PETSc
- Data
- Anything data type larger than a scalar has a
Create and a Destroy routine. - If you run ./myprog log_summary, you get
created and destroyed for each data type, to
find memory leaks.
67PETSc
68PETSc
- Example Vec
- Two types global and local
69PETSc
- Example Vec
- Two types global and local
- Dependent on function do other processors need
to see this data?
70PETSc
- Example Vec
- Two types global and local
- Dependent on function do other processors need
to see this data? - Basic usage
- Vec X
- VecCreate(PETSC_COMM_WORLD/PETSC_COMM_SELF,X)
71PETSc
- Example Vec
- Advanced usage
72PETSc
- Example Vec
- Advanced usage
- VecCreateSeq(PETSC_COMM_SELF,n,X)
73PETSc
- Example Vec
- Advanced usage
- VecCreateSeq(PETSC_COMM_SELF,n,X)
- VecCreateSeqWithArray(PETSC_COMM_SELF,n,vals,X)
74PETSc
- Example Vec
- Advanced usage
- VecCreateSeq(PETSC_COMM_SELF,n,X)
- VecCreateSeqWithArray(PETSC_COMM_SELF,n,vals,X)
- VecLoad(instream,VECSEQ,X)
75PETSc
- Example Vec
- Advanced usage
- VecCreateSeq(PETSC_COMM_SELF,n,X)
- VecCreateSeqWithArray(PETSC_COMM_SELF,n,vals,X)
- VecLoad(instream,VECSEQ,X)
- VecCreateMPI(PETSC_COMM_WORLD,n,PETSC_DETERMINE,X
)
76PETSc
- Example Vec
- Advanced usage
- VecCreateSeq(PETSC_COMM_SELF,n,X)
- VecCreateSeqWithArray(PETSC_COMM_SELF,n,vals,X)
- VecLoad(instream,VECSEQ,X)
- VecCreateMPI(PETSC_COMM_WORLD,n,PETSC_DETERMINE,X
) - VecCreateMPI(PETSC_COMM_WORLD,PETSC_DECIDE,N,X)
77PETSc
- Example Vec
- Advanced usage
- VecCreateSeq(PETSC_COMM_SELF,n,X)
- VecCreateSeqWithArray(PETSC_COMM_SELF,n,vals,X)
- VecLoad(instream,VECSEQ,X)
- VecCreateMPI(PETSC_COMM_WORLD,n,PETSC_DETERMINE,X
) - VecCreateMPI(PETSC_COMM_WORLD,PETSC_DECIDE,N,X)
- VecCreateMPIWithArray(PETSC_COMM_WORLD,n,N,vals,X
)
78PETSc
- Example Vec
- Advanced usage
- VecCreateSeq(PETSC_COMM_SELF,n,X)
- VecCreateSeqWithArray(PETSC_COMM_SELF,n,vals,X)
- VecLoad(instream,VECSEQ,X)
- VecCreateMPI(PETSC_COMM_WORLD,n,PETSC_DETERMINE,X
) - VecCreateMPI(PETSC_COMM_WORLD,PETSC_DECIDE,N,X)
- VecCreateMPIWithArray(PETSC_COMM_WORLD,n,N,vals,X
) - VecLoad(instream,VECMPI,X)
79PETSc
- Example Vec
- Advanced usage
- VecCreateSeq(PETSC_COMM_SELF,n,X)
- VecCreateSeqWithArray(PETSC_COMM_SELF,n,vals,X)
- VecLoad(instream,VECSEQ,X)
- VecCreateMPI(PETSC_COMM_WORLD,n,PETSC_DETERMINE,X
) - VecCreateMPI(PETSC_COMM_WORLD,PETSC_DECIDE,N,X)
- VecCreateMPIWithArray(PETSC_COMM_WORLD,n,N,vals,X
) - VecLoad(instream,VECMPI,X)
- VecDuplicate(Y,X)
80PETSc
- Example Vec
- Advanced usage
- VecCreateSeq(PETSC_COMM_SELF,n,X)
- VecCreateSeqWithArray(PETSC_COMM_SELF,n,vals,X)
- VecLoad(instream,VECSEQ,X)
- VecCreateMPI(PETSC_COMM_WORLD,n,PETSC_DETERMINE,X
) - VecCreateMPI(PETSC_COMM_WORLD,PETSC_DECIDE,N,X)
- VecCreateMPIWithArray(PETSC_COMM_WORLD,n,N,vals,X
) - VecLoad(instream,VECMPI,X)
- VecDuplicate(Y,X)
- MatGetVecs(M,X,PETSC_NULL)
81PETSc
- Example Vec
- Advanced usage
- VecCreateSeq(PETSC_COMM_SELF,n,X)
- VecCreateSeqWithArray(PETSC_COMM_SELF,n,vals,X)
- VecLoad(instream,VECSEQ,X)
- VecCreateMPI(PETSC_COMM_WORLD,n,PETSC_DETERMINE,X
) - VecCreateMPI(PETSC_COMM_WORLD,PETSC_DECIDE,N,X)
- VecCreateMPIWithArray(PETSC_COMM_WORLD,n,N,vals,X
) - VecLoad(instream,VECMPI,X)
- VecDuplicate(Y,X)
- MatGetVecs(M,X,PETSC_NULL)
- MatGetVecs(M,PETSC_NULL,X)
82PETSc
- Example Vec
- If not created with array or loaded from file,
values still needed
83PETSc
- Example Vec
- If not created with array or loaded from file,
values still needed - To copy the values of another Vec, with the same
parallel structure, use VecCopy(Y,X).
84PETSc
- Example Vec
- If not created with array or loaded from file,
values still needed - To copy the values of another Vec, with the same
parallel structure, use VecCopy(Y,X). - To set all values to a single scalar value, use
VecSet(X,alpha).
85PETSc
- Example Vec
- There are routines for more complicated ways to
set values
86PETSc
- Example Vec
- There are other routines for more complicated
ways to set values - PETSc guards the block of data where the actual
values are stored very closely
87PETSc
- Example Vec
- There are other routines for more complicated
ways to set values - PETSc guards the block of data where the actual
values are stored very closely - An assembly routine must be called after these
other routines
88PETSc
- Example Vec
- Other routines
89PETSc
- Example Vec
- Other routines
- VecSetValue
90PETSc
- Example Vec
- Other routines
- VecSetValue
- VecSetValueLocal (different indexing used)
91PETSc
- Example Vec
- Other routines
- VecSetValue
- VecSetValueLocal (different indexing used)
- VecSetValues
92PETSc
- Example Vec
- Other routines
- VecSetValue
- VecSetValueLocal (different indexing used)
- VecSetValues
- VecSetValuesLocal
93PETSc
- Example Vec
- Other routines
- VecSetValue
- VecSetValueLocal (different indexing used)
- VecSetValues
- VecSetValuesLocal
- VecSetValuesBlocked
94PETSc
- Example Vec
- Other routines
- VecSetValue
- VecSetValueLocal (different indexing used)
- VecSetValues
- VecSetValuesLocal
- VecSetValuesBlocked
- VecSetValuesBlockedLocal
95PETSc
- Example Vec
- Once a vector is assembled, there are routines
for (almost) every function we could want from a
vector AXPY, dot product, absolute value,
pointwise multiplication, etc.
96PETSc
- Example Vec
- Once a vector is assembled, there are routines
for (almost) every function we could want from a
vector AXPY, dot product, absolute value,
pointwise multiplication, etc. - Call VecDestroy(X) to free its array when it
isnt needed anymore.
97PETSc
98PETSc
- Example Mat
- Like Vec, a Mat can be global or local (MPI/Seq)
99PETSc
- Example Mat
- Like Vec, a Mat can be global or local (MPI/Seq)
- A Mat can take on a large number of data
structures to optimize and \, even though the
same routine is used on all structures.
100PETSc
- Example Mat
- Row compressed
- Block row compressed
- Symmetric block row compressed
- Block diagonal
- And even dense
101PETSc
102PETSc
- File I/O
- The equivalent to a stream is a viewer.
103PETSc
- File I/O
- PETSc has equivalent routines to printf, but you
must decide if you want every node to print or
just the control node
104PETSc
- File I/O
- PETSc has equivalent routines to printf, but you
must decide if you want every node to print or
just the control node - To ensure clarity when multiple nodes print, use
PetscSynchronizedPrintf followed by
PetscSynchronizedFlush.
105PETSc
- File I/O
- The equivalent to a stream is a viewer, but a
viewer organizes data across multiple processors.
106PETSc
- File I/O
- The equivalent to a stream is a viewer, but a
viewer organizes data across multiple processors. - A viewer combines an output location
(file/stdout/stderr), with a format.
107PETSc
- File I/O
- The equivalent to a stream is a viewer, but a
viewer organizes data across multiple processors. - A viewer combines an output location
(file/stdout/stderr), with a format. - Most data types have a View routine such as
MatView(M,viewer)
108PETSc
- File I/O
- On a batch server, ASCII I/O can be horrendously
slow.
109PETSc
- File I/O
- On a batch server, ASCII I/O can be horrendously
slow. - PETSc only reads into a parallel format data
which is stored in binary form.
110PETSc
- File I/O
- On a batch server, ASCII I/O can be horrendously
slow. - PETSc only reads into a parallel format data
which is stored in binary form. - Lots of output data is likely binary is more
compressed than ASCII.
111PETSc
- I have ASCII input data solution?
112PETSc
- I have ASCII input data solution?
- Write a wrapper program
- Runs on one processor
- Creates the data to be used in parallel, and
views it to a binary input file - In parallel, it will be automatically distributed
113PETSc
- Next Time, Issues for Large Dynamical Systems
- Time Stepping
- Updating algebraically
- Managing lots of similar equations
(Scattering/Gathering)