Title: Cluster OpenMP Benchmark of 64-bit PC Cluster
1Cluster OpenMP Benchmark of 64-bit PC Cluster
Yao-Huan Tseng Institute of Astronomy and
Astrophysics, Academia Sinica, ROC
Abstract Although the MPI has become
accepted as a portable style of parallel
programming, but has several significant
weaknesses that limit its effectiveness and
scalability. MPI in general is difficult to
program and doesn't support incremental
parallelization of an existing sequential
program. With the current trend in parallel
computer architectures towards clusters of shared
memory symmetric multi-processors (SMP), e.g.
dual-core, quad-core, dual dual-core. Clusters of
SMP (Symmetric Multi-Processors) nodes provide
support for a wide range of parallel programming
paradigms. The shared address space within each
node is suitable for OpenMP parallelization. I
have implemented a direct nbody program with
OpenMP and test on IAA PC Cluster to demonstrate
the feasibility and efficiency. The most benefit
is easy to use.
- Hardware
- 8 Intel single-cpu PCs (3GHz P4, 2GB DDR, 80GB
IDE, Gbps NIC) - 4 AMD2 PCs (1.0GHz AMD dual core 5000, 2GB
DDR2, 120GB IDE, Gbps NIC) - 4 dual AMD PCs (1.8GHz AMD Opetron244, 2GB DDR,
120GB IDE, Gbps NIC) - 2 dual AMD2 PCs (2GHz Opteron270 dual core, 8GB
DDR2, 120GB IDE, Gbps NIC) - 2 Intel dual-core PCs (3GHz Pentium, 2GB DDR,
80GB IDE, Gbps NIC)
- Software
- a simple direct nbody program (particle number
100,000) - OpenMP directives
- OpenMP is a specification for a set of compiler
directives, library routines, and environment
variables that can be used to specify shared
memory parallelism in Fortran and C/C programs.
- Intel fortran compiler v9.1 with Cluster OpenMP
- Cluster OpenMP is an implementation of OpenMP
that can make use of multiple SMP machines
without resorting to MPI.
node x thread 1 x 2 1 node, each one with 2
threads 2 x 1 2 nodes, each one with 1
threads 2 x 2 2 nodes, each one with 2
threads 4 x 1 4 nodes, each one with 1 thread 1
x 4 1 nodes, each one with 4 threads 4 x 2 4
nodes, each one with 2 threads 2 x 4 2 nodes,
each one with 4 threads
Fig6
- Summary
- Only six OpenMP directives added in the code, the
efficiency of a direct nbody simulation can speed
up factor of 7 with 8 AMD CPUs running
simultaneously (Fig4.). - Although the frequency of AMD CPU is lower than
Intel CPU, the performance per GHz is almost the
same for different CPUs. Except the AMD2
dual-core, AMD2 is about 3 times of the others
(Fig6). - Because of limiting memory bandwidth, the
efficiency of dual CPU is better than dual-core
CPU (Fig2 and Fig4). The hyper-threading
technology of Intel CPU is the worst case. Do not
submit more than one thread at one single CPU. - In Linpack test, the performance of two CPUs is
two times of one CPU for a dual AMD machine
(Fig10). But the performance is only 1.01 for a
Intel dual-core machine (Fig8).
2007/3/20