Title: HighPerformance Computations and the technologies of Microsoft
1Nizhni Novgorod State University
High-performance computations and the
technologies of Microsoft
Prof. Gergel V.P., D.Sc., Software Department, NN
SU
www.unn.ru www.software.unn.ac.ru
2- Needs of High Performance Computations (HPC)
- Windows based Clusters Microsoft Compute
Cluster Server
- HPC Simplicity vs Complexity Brief
Introduction to MPI
- How to overcome the complexity HPC Curriculum
3Needs of High Performance Computations
- Time-consuming nature of many scientific and
engineering problems (problems of "Great
Challenge")
- Increase of serial computers performance is
limited
- Cost of parallel computational systems is
reducing (clusters,)
- Parallelism on processor layer
HyperThreading, multicore (70 of market in 2006)
4Needs of High Performance Computations
Price is reduced more than 10,000 times !!!
Supercomputing Goes Personal
5Needs of High Performance Computations
PVP
SMP (incl. Multi-core)
UMA
Multiprocessors
COMA
(shared memory)
CC-NUMA
NUMA
NCC-NUMA
MIMD
Cluster
NORMA
Multipcomputers
(distributed memory)
MPP
6Needs of High Performance Computations
- Cluster
- Group of computers (local network) capable to
work as a unified computational unit,
- Higher reliability and efficiency than local
network,
- Essentially lower cost comparing to other types
of parallel computational systems (by using of
commodity-on-the-shelves hardware and software)
7- Needs of High Performance Computations (HPC)
- Windows based Clusters Microsoft Compute
Cluster Server
- HPC Simplicity vs Complexity Brief
Introduction to MPI
- How to overcome the complexity HPC Curriculum
8Windows based Clusters - Microsoft Compute
Cluster Server
- Microsoft Vision in HPC area
- Compute Cluster Server (CCS) consist of
- Dedicated release of OS Windows Server 2003 -
Cluster Edition
- Compute Cluster Pack (CCP)
- MS MPI implementation of the standard MPI-2,
- Cluster management system,
- GUI, CUI, COM and other interfaces for job
submitting
- Current Release - Community Preview Release 3
- First Release CCS became available in November,
2005
- Download - http//www.connect.microsoft.com
9Microsoft Compute Cluster Server
- Computational Nodes
- 64-bit processors of x86 family,
- 512 ?b RAM,
- 4 Gb HDD,
- 64-bit Microsoft Windows Server 2003
- Parallel Software Development
- PC under MS Windows XP, 2003, Vista,
- MS Compute Cluster Pack SDK,
- Recommended IDE MS Visual Studio 2005
10Microsoft Compute Cluster Server
- Job Management
- CCS provides job management and efficient use of
the resources on the compute cluster,
- The interfaces for scheduling jobs include
- Command Line Interface (CLI),
- GUI,
- Web UI,
- Web-services, COM,
11Microsoft Compute Cluster Server
- Job Management provide
- Schedule the job execution,
- Inspect current states of jobs,
- Terminate jobs,
- etc.
12Microsoft Compute Cluster Server
- Development and executing MPI-programs
- IDE VS 6.0, VS 2003, VS 2005,
- Language ?,
- MS MPI is compatible with MPICH-2 (on source code
level),
- mpiexec is used to run MPI-programs at the same
way as for MPICH-2
13Microsoft Compute Cluster Server
- Debugging MPI-programs
- Visual Studio 2005 and CCP has build-in
MPI-debugger!
14Microsoft Compute Cluster Server
15Microsoft Compute Cluster Server
16- Needs of High Performance Computations (HPC)
- Windows based Clusters Microsoft Compute
Cluster Server
- HPC Simplicity vs Complexity Brief
Introduction to MPI
- How to overcome the complexity HPC Curriculum
17HPC Simplicity vs Complexity Brief
Introduction to MPI
The processors in the computer systems with
distributed memory operate independently.
It is necessary to have a possibility
- to distribute the computational load,
- to organize the information communication (data
transmission) among the processors.
The solution of the above mentioned problems is
provided by the MPI (Message Passing Interface)
18Brief Introduction to MPI
- Example Computing the constant ?
- The value of constant ? can be computed by means
of the integral
- To compute this integral the method of rectangles
can be used for numerical integration
19Brief Introduction to MPI
// Serial programs include double f(d
ouble a) return (4.0 / (1.0 aa)) i
nt main(int argc, char argv)
int n, i double PI25DT 3.1415926535897932
38462643 double mypi, h, sum, x printf("
Enter the number of intervals ")
scanf("d",n) // calculating h 1.0 /
(double) n sum 0.0 for ( i 0 i i ) x h ((double)i 0.5)
sum f(x) mypi h sum printf(
"pi is approximately .16f, error is .16f\n",
mypi, fabs(pi PI25DT))
20Brief Introduction to MPI
- Parallel method
- Cyclic scheme can be used to distribute the
calculations among the processors
- Partial sums, that were calculated on different
processors, have to be summed
- Processor 0 - Processor 1 - Processor 2
21Brief Introduction to MPI
include "mpi.h" include double f(do
uble a) return (4.0 / (1.0 aa)) in
t main(int argc, char argv)
int ProcRank, ProcNum, n, i
double PI25DT 3.141592653589793238462643
double mypi, pi, h, sum, x, t1, t2
MPI_Init(argc,argv) MPI_Comm_size(MPI_COM
M_WORLD,ProcNum) MPI_Comm_rank(MPI_COMM_WORL
D,ProcRank) if ( ProcRank 0) pri
ntf("Enter the number of intervals ")
scanf("d",n) t1 MPI_Wtime()
22Brief Introduction to MPI
MPI_Bcast(n, 1, MPI_INT, 0, MPI_COMM_WORLD)
// calculating the local sums
h 1.0 / (double) n sum 0.0 for (i
ProcRank 1 i x h ((double)i 0.5)
sum f(x) mypi h sum //
reduction MPI_Reduce(mypi,pi,1,MPI_DOUBLE,MPI
_SUM,0,MPI_COMM_WORLD) if ( ProcRank 0 )
// printing results t2 MPI_Wtime()
printf("pi is approximately .16f, Error is
.16f\n",pi, fabs(pi PI25DT))
printf("wall clock time f\n",t2-t1)
MPI_Finalize()
23- Needs of High Performance Computations (HPC)
- Windows based Clusters Microsoft Compute
Cluster Server
- HPC Simplicity vs Complexity Brief
Introduction to MPI
- How to overcome the complexity HPC Curriculum
24How to overcome the complexity HPC Curriculum
- HPC Required Skills and Knowledge
- Architecture of parallel computer systems
- Computation models and methods for analyzing
complexity of calculations
- Parallel computation methods
- Parallel programming (languages, development
environments, libraries),
It is important to have an integrated teaching
course on parallel programming
25HPC Curriculum
- The essential curriculum part is an
integrated course "HPC and parallel programming"
which provides
- studying the models of parallel computations,
- mastering in parallel algorithms and
- getting practical experience in parallel
programming.
- The course provides good knowledge in many
parallel programming areas (models, methods,
technologies, programs) for students. Learning
combines theoretical classes and laboratory works
26HPC Curriculum
- Components
- Course syllabus,
- Syllabus of laboratory works,
- E-textbook,
- Program system for supporting laboratory works,
- Manual of program system user,
- Function library,
- Function library reference guide,
- PowerPoint presentations for all lections
- http//www.software.unn.ac.ru/ccam
- Development of HPC curriculum has been supported
by Microsoft
27HPC Curriculum
- Highlights of the course
- Comprehensive learning of the spectrum of
parallel programming issues (models, methods,
technologies, programs)
- Organic combination of theoretical classes and
laboratory training
- Intensive use of research and educational
software systems for carrying out computational
experiments
28HPC Curriculum
- Syllabus
- Architecture of parallel computers and their
classification,
- Modeling and analysis of parallel computations,
- Analysis of communication complexity of parallel
programs,
- Technology for developing parallel programs
- Parallel expansions for industrial algorithmic
languages (OpenMP),
- Developer's libraries for parallel programming
(MPI)
- Principles of parallel algorithm design,
- Parallel computation methods
29HPC Curriculum
- Modeling and analysis of parallel computations
- Computations model in the form of an operations
operands graph,
- Description of the scheme for parallel execution
of an algorithm,
- Predicting the time for executing of a parallel
algorithm,
- Efficiency criteria of a parallel algorithm
30HPC Curriculum Modeling and analysis of
parallel computations
- Characteristics of Parallel Algorithm
Efficiency
- Speedup
- Efficiency
- Very often these criteria are antagonistic !
-
31HPC Curriculum Modeling and analysis of
parallel computations
- Example Total Sum Computation
- The computation of the total sum of the available
set of values (particular case of the general
reduction problem)
32HPC Curriculum Modeling and analysis of
parallel computations
- Example Total Sum Computation
- Sequential summation of the elements of a series
of values
This standard sequential summation algorithm
allows only strictly serial execution and
cannot be parallelized
33HPC Curriculum Modeling and analysis of
parallel computations
- Example Total Sum Computation
- Cascade Summation Scheme
!!!
34HPC Curriculum Modeling and analysis of
parallel computations
- Example Total Sum Computation
- Modified Cascade Scheme
35HPC Curriculum Analysis of communication
complexity
- Characteristics of the topology of data
communication network,
- General description of data communication
techniques,
- Analysis of time complexity for data
communication operations,
- Methods of logic representation of communication
topology
36HPC Curriculum Principles of parallel algorithm
design
The general scheme of parallel algorithm design
(proposed by I. Foster)
37HPC Curriculum Parallel algorithms
- Matrix-vector multiplication
- Matrix multiplication
- Sorting
- Graph processing
- Partial differential equations
- Optimization
38HPC Curriculum Parallel algorithms
- Example Matrix multiplication by Cannons
Method
- Data distribution Checkerboard scheme
- Basic subtask is a procedure, that calculates all
elements of one block of matrix C
39HPC Curriculum Parallel algorithms
- Example Matrix multiplication by Cannons
Method
- Analysis of Information Dependencies
- The subtask with the number (i,j) calculates the
block Cij, of the result matrix C. As a result,
the subtasks form the qq two-dimensional grid,
- The initial distribution of matrix blocks in
Cannons algorithm is selected in such a way that
the first block multiplication can be performed
without additional data transmission - At the beginning each subtask (i,j) holds the
blocks Aij and Bij,
- For the i-th row of the subtasks grid the matrix
A blocks are shifted for (i-1) positions to the
left,
- For the j th column of the subtasks grid the
matrix B blocks are shifted for (j-1) positions
upward,
- Data transmission operations are the example of
the circular shift communication
40HPC Curriculum Parallel algorithms
- Example Matrix multiplication by Cannons
Method
- Analysis of Information Dependencies
- After the redistribution, which was performed at
the first stage, the matrix blocks can be
multiplied without additional data transmission
operations, - To obtain all of the rest blocks after the
operation of blocks multiplication
- Matrix A blocks are shifted for one position left
along the grid row,
- Matrix B blocks are shifted for one position
upward along the grid column.
41HPC Curriculum Parallel algorithms
- Example Matrix multiplication by Cannons
Method
- Scaling and Distributing the Subtasks among the
Processors
- The sizes of the matrices blocks can be selected
so that the number of subtasks will coincides
the number of available processors p,
- The most efficient execution of the parallel
Canons algorithm can be provided when the
communication network topology is a
two-dimensional grid, - In this case the subtasks can be distributed
among the processors in a natural way the
subtask (i,j) has to be placed to the pi,j
processor
42HPC Curriculum Parallel algorithms
- Example Matrix multiplication by Cannons
Method
- Efficiency Analysis
- Speed-up and Efficiency generalized estimates
Developed method of parallel computations allows
to achieve ideal speed-up and efficiency charact
eristics
43HPC Curriculum Parallel algorithms
- Example Matrix multiplication by Cannons
Method
- Efficiency Analysis (detailed estimates)
- The Cannons algorithm differs from the Foxs
algorithm only in the types of communication
operations, that is
- Time of the initial redistribution of matrices
blocks
- After every multiplication operation matrix
blocks are shifted
Total time of parallel algorithm execution is
44HPC Curriculum Parallel algorithms
- Example Matrix multiplication by Cannons
Method
- Results of computational experiments
- Comparison of theoretical estimations and results
of computational experiments
45HPC Curriculum Parallel algorithms
- Example Matrix multiplication by Cannons
Method
- Results of computational experiments
- Speedup
46HPC Curriculum Laboratory classes
- Methods of parallel programs development for
multi-processor systems with shared and
distributed memory using OpenMP and MPI
technologies - Training for developing parallel algorithms and
programs for solving computational problems
- Training on using parallel methods libraries for
solving complex scientific and engineering
problems
47HPC Curriculum Laboratory classes
- Computational experiments on parallel systems
48HPC Curriculum Laboratory classes
- Intensive use of research and education software
systems for modeling computations on various
multiprocessor systems and visualization of
parallel computation processes - The system Parallel Laboratory (ParaLab) the
software system for studying and investigations
parallel methods for solving time-consuming
problems
49HPC Curriculum Laboratory classes with ParaLab
- Modelling a parallel computing system,
- Choice of computing problems and methods to
solve them,
- Carrying out computational experiments,
- Parallel computation visualization,
- Information gathering and analysis of results
(experiment log"),
- Data archive
50HPC Curriculum Laboratory classes with ParaLab
Experiment's results
Area for experiment data visualization
Visualization of an processor's operations
51HPC Curriculum Laboratory classes with ParaLab
- Modelling a parallel computing system
52HPC Curriculum Laboratory classes with ParaLab
- Choosing computing problems and methods to solve
them
53HPC Curriculum Laboratory classes with ParaLab
- Computational experiments and parallel
computation visualization
Matrix computations
Sorting
54HPC Curriculum Laboratory classes with ParaLab
- Information gathering and analysis of results
(experiment log")
55HPC Curriculum Laboratory classes with ParaLab
- System usage experience shows, that ParaLab may
be useful for both novices, who are just starting
to learn parallel computing, and sometimes even
for experts in this perspective sphere of
strategic computer technology
56Winter School on Parallel Computing 2004, 2005,
2006
- January 25 February 7, 2004,
- 39 participants from 11 cities in CIS,
- 6 lecture courses given by leading
specialists in parallel computing,
- scientific seminar
57Winter School on Parallel Computing 2004, 2005,
2006
- School syllabus
- Technologies of parallel programming (Gergel
V. NNSU, Popova N. MSU),
- Parallel data bases (Sokolinsky L. ChelSU),
- Parallel computation models (on the based of
DVM system) Krukov V. (IPM RAN)
- Parallel computational algorithms (Yakobovski
M. IMM RAN)
58Winter School on Parallel Computing 2004, 2005,
2006
- School highlights
- Intensive form of classes (9-00?18-00 daily,
till 21-00 self-instruction works),
- Predominance of practical classes and
laboratory works,
- Remote access to many Russian high-performance
resources (clusters of NNSU, MSU, RCC MSU, ICC
RAN, SPbSU, IAP RAN),
- Holding training on parallel software
development tools (Intel),
- Holding research and educational seminar for
students and scientists
- Winter School has been supported by Intel
59Conclusions
- High performance computing - Challenge for CS and
IT
- Microsoft Vision Clusters under Compute Cluster
Server
- UNN HPC Curriculum provides the easiest entering
to HPC world
60Contacts
University of Nizhni Novgorod
23, Gagarin Avenue 603950, Nizhni Novgorod Tel
7 (8312) 65-48-59,
E-mail gergel_at_unn.ac.ru
61Questions, Remarks, Something to add