Overview of Extreme-Scale Software Research in China - PowerPoint PPT Presentation

About This Presentation
Title:

Overview of Extreme-Scale Software Research in China

Description:

Overview of Extreme-Scale Software Research in China Depei Qian Sino-German Joint Software Institute (JSI) Beihang University China-USA Computer Software Workshop – PowerPoint PPT presentation

Number of Views:157
Avg rating:3.0/5.0
Slides: 74
Provided by: luy55
Category:

less

Transcript and Presenter's Notes

Title: Overview of Extreme-Scale Software Research in China


1
Overview of Extreme-Scale Software Research in
China
  • Depei Qian
  • Sino-German Joint Software Institute (JSI)
  • Beihang University
  • China-USA Computer Software Workshop
  • Sep. 27, 2011

2
Outline
  • Related RD efforts in China
  • Algorithms and Computational Methods
  • HPC and e-Infrastructure
  • Parallel programming frameworks
  • Programming heterogeneous systems
  • Advanced compiler technology
  • Tools
  • Domain specific programming support

3
Related RD efforts in China
  • NSFC
  • Basic algorithms and computable modeling for high
    performance scientific computing
  • Network based research environment
  • Many-core parallel programming
  • 863 program
  • High productivity computer and Grid service
    environment
  • Multicore/many-core programming support
  • HPC software for earth system modeling
  • 973 program
  • Parallel algorithms for large scale scientific
    computing
  • Virtual computing environment

4
Algorithms and Computational Methods
5
NSFCs Key Initiative on Algorithm and Modeling
  • Basic algorithms and computable modeling for high
    performance scientific computing
  • 8-year, launched in 2011
  • 180 million Yuan funding
  • Focused on
  • Novel computational methods and basic parallel
    algorithms
  • Computable modeling for selected domains
  • Implementation and verification of parallel
    algorithms by simulation

6
  • HPC e-Infrastructure

7
863s key projects on HPC and Grid
  • High productivity Computer and Grid Service
    Environment
  • Period 2006-2010
  • 940 million Yuan from the MOST and more than 1B
    Yuan matching money from other sources
  • Major RD activities
  • Developing PFlops computers
  • Building up a grid service environment--CNGrid
  • Developing Grid and HPC applications in selected
    areas

8
CNGrid GOS Architecture
9
Abstractions
  • Grid community Agora
  • persistent information storage and organization
  • Grid process Grip
  • runtime control

10
CNGrid GOS deployment
  • CNGrid GOS deployed on 11 sites and some
    application Grids
  • Support heterogeneous HPCs Galaxy, Dawning,
    DeepComp
  • Support multiple platforms Unix, Linux, Windows
  • Using public network connection, enable only HTTP
    port
  • Flexible client
  • Web browser
  • Special client
  • GSML client

11
Tsinghua University 1.33TFlops, 158TB storage,
29 applications, 100 users. IPV4/V6 access
CNIC 150TFlops, 1.4PB storage,30 applications,
269 users all over the country, IPv4/v6 access
IAPCM 1TFlops, 4.9TB storage, 10 applications,
138 users, IPv4/v6 access
Shandong University 10TFlops, 18TB storage, 7
applications, 60 users, IPv4/v6 access
GSCC 40TFlops, 40TB, 6 applications, 45 users ,
IPv4/v6 access
SSC 200TFlops, 600TB storage, 15 applications,
286 users, IPv4/v6 access
XJTU 4TFlops, 25TB storage, 14 applications,
120 users, IPv4/v6 access
USTC 1TFlops, 15TB storage, 18 applications, 60
users, IPv4/v6 access
HUST 1.7TFlops, 15TB storage, IPv4/v6 access
SIAT 10TFlops, 17.6TB storage, IPv4v6 access
HKU 20TFlops, 80 users, IPv4/v6 access
12
CNGrid resources
  • 11 sites
  • gt450TFlops
  • 2900TB storage
  • Three PF-scale sites will be integrated into
    CNGrid soon

13
CNGridservices and users
  • 230 services
  • gt1400 users
  • China commercial Aircraft Corp
  • Bao Steel
  • automobile
  • institutes of CAS
  • universities

14
CNGridapplications
  • Supporting gt700 projects
  • 973, 863, NSFC, CAS Innovative, and Engineering
    projects

15
Parallel programming frameworks
16
Jasmin A parallel programming Framework
Models Stencils Algorithms
Library
Models Stencils Algorithms
separate
Special
Common
Also supported by the 973 and 863 projects
Computers
17
Basic ideas
  • Hide the complexity of programming millons of
    cores
  • Integrate the efficient implementations of
    parallel fast numerical algorithms
  • Provide efficient data structures and solver
    libraries
  • Support software engineering for code
    extensibility.

18
Basic Ideas
PetaFlops MPP
Scale up using Infrastructures
TeraFlops Cluster
Serial Programming
Personal Computer
19
JASMIN
Structured Grid
Inertial Confinement Fusion Global Climate
Modeling CFD Material Simulations
Particle Simulation
Unstructured Grid
20
JASMIN
User provides physics, parameters, numerical
methods,
expert experiences, special algorithms, etc.
User InterfacesComponents based Parallel
Programming models. ( C classes)
Numerical Algorithmsgeometry, fast solvers,
mature numerical methods, time integrators, etc.
HPC implementations( thousands of CPUs)data
structures, parallelization, load balancing,
adaptivity, visualization, restart, memory, etc.
ArchitectureMultilayered, Modularized,
Object-oriented Codes C/C/F90/F77MPI/OpenMP,
500,000 lines Installation Personal computers,
Cluster, MPP.
21
Numerical simulations on TianHe-1A
Codes CPU cores Codes CPU cores
LARED-S 32,768 RH2D 1,024
LARED-P 72,000 HIME3D 3,600
LAP3D 16,384 PDD3D 4,096
MEPH3D 38,400 LARED-R 512
MD3D 80,000 LARED Integration 128
RT3D 1,000
Simulation duration several hours to tens of
hours.
22
Programming heterogeneous systems
23
GPU programming support
  • Source to source translation
  • Runtime optimization
  • Mixed programming model for multi-GPU systems

24
S2S translation for GPU
  • A source-to-source translator, GPU-S2S, for GPU
    programming
  • Facilitate the development of parallel programs
    on GPU by combining automatic mapping and static
    compilation

25
S2S translation for GPU (cond)
  • Insert directives into the source program
  • Guide implicit call of CUDA runtime libraries
  • Enable the user to control the mapping from the
    homogeneous CPU platform to GPUs streaming
    platform
  • Optimization based on runtime profiling
  • Take full advantage of GPU according to the
    application characteristics by collecting runtime
    dynamic information.

26
The GPU-S2S architecture
27
Program translation by GPU-S2S
28
Runtime optimization based on profiling
First level profiling (function level)
Second level profiling (memory access and kernel
improvement )
Third level profiling (data partition)
29
First level profiling
  • Identify computing kernels
  • Instrument the scan source code, get the
    execution time of every function, and identify
    computing kernel

30
Second level profiling
  • Identify the memory access pattern and improve
    the kernels
  • Instrument the computing kernels
  • extract and analyze the profile information,
    optimize according to the feature of application,
    and finally generate the CUDA code with optimized
    kernel

31
Third level profiling
  • Optimization by improve data partition
  • Get copy time and computing time by
    instrumentation
  • Compute the number of streams and data size of
    each stream
  • Generate the optimized CUDA code with stream

32
Matrix multiplication Performance comparison
before and after profile
The CUDA code with three level profiling
optimization achieves 31 improvement over the
CUDA code with only memory access optimization,
and 91 improvement over the CUDA code using only
global memory for computing .
Execution performance comparison on different
platform
33
The CUDA code after three level profile
optimization achieves 38 improvement over the
CUDA code with memory access optimization, and
77 improvement over the CUDA code using only
global memory for computing .
FFT(1048576 points) Performance comparison
before and after profile
FFT(1048576 points ) execution performance
comparison on different platform
34
Programming Multi-GPU systems
  • The memory of the CPUGPU system are both
    distributed and shared. So it is feasible to use
    MPI and PGAS programming model for this new kind
    of system.
  • MPI
    PGAS
  • Using message passing or shared data for
    communication between parallel tasks or GPUs

35
Mixed Programming Model
NVIDIA GPU CUDA Traditional
Programming model MPI/UPC
MPICUDA/UPCCUDA
CUDA program execution
36
MPICUDA experiment
  • Platform
  • 2NF5588 server, equipped with
  • 1 Xeon CPU (2.27GHz), 12GB MM
  • 2 NVIDIA Tesla C1060 GPU(GT200 architecture,4GB
    deviceMM)
  • 1Gbt Ethernet
  • RedHatLinux5.3
  • CUDA Toolkit 2.3 and CUDA SDK
  • OpenMPI 1.3
  • BerkeleyUPC 2.1

37
MPICUDA experiment (cond)
  • Matrix Multiplication program
  • Using block matrix multiply for UPC programming.
  • Data spread on each UPC thread.
  • The computing kernel carries out the
    multiplication of two blocks at one time, using
    CUDA to implement.
  • The total time of executionTsumTcomTcudaTcomT
    copyTkernel
  • Tcom UPC thread communication time
  • Tcuda CUDA program execution time
  • Tcopy Data transmission time between host
    and device
  • Tkernel GPU computing time

38
MPICUDA experiment (cond)
2server,8 MPI task most
1 server with 2 GPUs
  • For 40944096,the speedup of 1 MPICUDA task (
    using 1 GPU for computing) is 184x of the case
    with 8 MPI task.
  • For small scale data,such as 256,512 , the
    execution time of using 2 GPUs is even longer
    than using 1 GPUs
  • the computing scale is too small , the
    communication between two tasks overwhelm the
    reduction of computing time.

39
PKU Manycore Software Research Group
  • Software tool development for GPU clusters
  • Unified multicore/manycore/clustering programming
  • Resilience technology for very-large GPU clusters
  • Software porting service
  • Joint project, lt3k-line Code, supporting Tianhe
  • Advanced training program

40
PKU-Tianhe Turbulence Simulation
PKUFFT(using GPUs)
  • Reach a scale 43 times higher than that of the
    Earth Simulator did
  • 7168 nodes / 14336 CPUs / 7168 GPUs
  • FFT speed 1.6X of Jaguar
  • Proof of feasibility of GPU speed up for large
    scale systems

MKL(not using GPUs)
Jaguar
41
Advanced Compiler Technology
42
Advanced Compiler Technology (ACT) Group at the
ICT, CAS
  • ACTs Current research
  • Parallel programming languages and models
  • Optimized compilers and tools for HPC (Dawning)
    and multi-core processors (Loongson)
  • Will lead the new multicore/many-core programming
    support project

43
PTA Process-based TAsk parallel programming model
  • new process-based task construct
  • With properties of isolation, atomicity and
    deterministic submission
  • Annotate a loop into two parts, prologue and task
    segment
  • pragma pta parallel clauses
  • pragma pta task
  • pragma pta propagate (varlist)
  • Suitable for expressing coarse-grained, irregular
    parallelism on loops
  • Implementation and performance
  • PTA compiler, runtime system and assistant tool
    (help writing correct programs)
  • Speedup 4.62 to 43.98 (average 27.58 on 48
    cores) 3.08 to 7.83 (average 6.72 on 8 cores)
  • Code changes is within 10 lines, much smaller
    than OpenMP

44
UPC-H A Parallel Programming Model for Deep
Parallel Hierarchies
  • Hierarchical UPC
  • Provide multi-level data distribution
  • Implicit and explicit hierarchical loop
    parallelism
  • Hybrid execution model SPMD with fork-join
  • Multi-dimensional data distribution and
    super-pipelining
  • Implementations on CUDA clusters and Dawning 6000
    cluster
  • Based on Berkeley UPC
  • Enhance optimizations as localization and
    communication optimization
  • Support SIMD intrinsics
  • CUDA cluster72 of hand-tuned versions
    performance, code reduction to 68
  • Multi-core cluster better process mapping and
    cache reuse than UPC

45
OpenMP and Runtime Support for Heterogeneous
Platforms
  • Heterogeneous platforms consisting of CPUs and
    GPUs
  • Multiple GPUs, or CPU-GPU cooperation brings
    extra data transfer hurting the performance gain
  • Programmers need unified data management system
  • OpenMP extension
  • Specify partitioning ratio to optimize data
    transfer globally
  • Specify heterogeneous blocking sizes to reduce
    false sharing among computing devices
  • Runtime support
  • DSM system based on the blocking size specified
  • Intelligent runtime prefetching with the help of
    compiler analysis
  • Implementation and results
  • On OpenUH compiler
  • Gains 1.6X speedup through prefetching on NPB/SP
    (class C)

46
Analyzers based on Compiling Techniques for MPI
programs
  • Communication slicing and process mapping tool
  • Compiler part
  • PDG Graph Building and slicing generation
  • Iteration Set Transformation for approximation
  • Optimized mapping tool
  • Weighted graph, Hardware characteristic
  • Graph partitioning and feedback-based evaluation
  • Memory bandwidth measuring tool for MPI programs
  • Detect the burst of bandwidth requirements
  • Enhance the performance of MPI error checking
  • Redundant error checking removal by dynamically
    turning on/off the global error checking
  • With the help of compiler analysis on
    communicators
  • Integrated with a model checking tool (ISP) and a
    runtime checking tool (MARMOT)

47
LoongCC An Optimizing Compiler for Loongson
Multicore Processors
  • Based on Open64-4.2 and supporting C/C/Fortran
  • Open source at http//svn.open64.net/svnroot/open6
    4/trunk/
  • Powerful optimizer and analyzer with better
    performances
  • SIMD intrinsic support
  • Memory locality optimization
  • Data layout optimization
  • Data prefetching
  • Load/store grouping for 128-bit memory access
    instructions
  • Integrated with Aggressive Auto Parallelization
    Optimization (AAPO) module
  • Dynamic privatization
  • Parallel model with dynamic alias optimization
  • Array reduction optimization

48
Tools
49
Testing and evaluation of HPC systems
  • A center led by Tsinghua University (Prof.
    Wenguang Chen)
  • Developing accurate and efficient testing and
    evaluation tools
  • Developing benchmarks for HPC evaluation
  • Provide services to HPC developers and users

50
LSP3AS large-scale parallel program performance
analysis system
  • Designed for performance tuning on peta-scale HPC
    systems
  • Method
  • Source code is instrumented
  • Instrumented code is executed, generating
    profilingtracing data files
  • The profilingtracing data is analyzed and
    visualization report is generated
  • Instrumentation based on TAU from University of
    Oregon

51
LSP3AS large-scale parallel program performance
analysis system
  • Scalable performance data collection
  • Distributed data collection and transmission
    eliminate bottlenecks in network and data
    processing
  • Dynamic Compensation reduce the influence of
    performance data volume
  • Efficient Data Transmission use Remote Direct
    Memory Access (RDMA) to achieve high bandwidth
    and low latency

52
LSP3AS large-scale parallel program performance
analysis system
  • Analysis Visualization
  • Data Analysis Iteration-based clustering are
    used
  • Visualization Clustering visualization Based on
    Hierarchy Classification

53
SimHPC Parallel Simulator
  • Challenge for HPC Simulation performance
  • Target system gt1,000 nodes and processors
  • Difficult for traditional architecture simulators
  • e.g. Simics
  • Our solution
  • Parallel simulation
  • Using cluster to simulate cluster
  • Use same node in host system with the target
  • Advantage no need to model and simulate detailed
    components, such as pipeline in processors and
    cache
  • Execution-driven, Full-system simulation, support
    execution of Linux and applications include
    benchmarks (e.g. Linpack)

54
SimHPC Parallel Simulator (cond)
  • Analysis
  • Execution time of a process in target system is
    composed of
  • Trun execution time of instruction sequences
  • TIO I/O blocking time, such as r/w files,
    send/recv msgs
  • Tready waiting time in ready-state
  • So, Our simulator needs to
  • Capture system events
  • process scheduling
  • I/O operations read/write files, MPI
    send()/recv()
  • Simulate I/O and interconnection network
    subsystems
  • Synchronize timing of each application process

55
SimHPC Parallel Simulator (cond)
  • System Architecture
  • Application processes of multiple target nodes
    allocated to one host node
  • number of host nodes ltlt number of target nodes
  • Events captured on host node while application is
    running
  • Events sent to the central node for time
    analysis, synchronization, and simulation

56
SimHPC Parallel Simulator (cond)
  • Experiment Results
  • Host 5 IBM Blade HS21 (2-way Xeon)
  • Target 32 1024 nodes
  • OS Linux
  • App Linpack HPL
  • Simulation Slowdown

Simulation Error Test
Communication time for Fat-tree and 2D-mesh
Interconnection networks
Linpack performance for Fat-tree and 2D-mesh
Interconnection networks
57
System-level Power Management
  • Power-aware Job Scheduling algorithm
  • Suspend a node if its idle-time gt threshold
  • Wakeup nodes if there is no enough nodes to
    execute jobs, while
  • Avoid node thrashing between busy and suspend
    state
  • The algorithm is integrated into OpenPBS

58
System-level Power Management (cond)
  • Power Management Tool
  • Monitor the power-related status of the system
  • Reduce runtime power consumption of the machine
  • Multiple power management policies
  • Manual-control
  • On-demand control
  • Suspend-enable

Layers of Power Management
59
System-level Power Management (cond)
  • Power Management Test
  • On 5 IBM HS21 blades

Task Load (tasks per hour) Power Management Policy Task Exec. Time (s) Power Consumption(J) Comparison Comparison
Task Load (tasks per hour) Power Management Policy Task Exec. Time (s) Power Consumption(J) Performance slowdown Power Saving
20 On-demand 3.55 1778077 5.15 -1.66
20 Suspend 3.60 1632521 9.76 -12.74
200 On-demand 3.55 1831432 4.62 -3.84
200 Suspend 3.65 1683161 10.61 -10.78
800 On-demand 3.55 2132947 3.55 -7.05
800 Suspend 3.66 2123577 11.25 -9.34
Power management test for different Task
Load (Compared to no power management)
60
Domain specific programming support
61
Parallel Computing Platform for Astrophysics
  • Joint work
  • Shanghai Astronomical Observatory, CAS (SHAO),
  • Institute of Software, CAS (ISCAS)
  • Shanghai Supercomputer Center (SSC)
  • Build a high performance parallel computing
    software platform for astrophysics research,
    focusing on the planetary fluid dynamics and
    N-body problems
  • New parallel computing models and parallel
    algorithms studied, validated and adopted to
    achieve high performance.

62
Architecture
63
PETSc Optimized (Speedup 15-26)
  • Method 1 Domain Decomposition Ordering Method
    for Field Coupling
  • Method 2 Preconditioner for Domain Decomposition
    Method
  • Method 3 PETSc Multi-physics Data Structure

Left mesh 128 x 128 x 96
Right mesh 192 x 192 x 128 Computation Speedup
15-26 Strong scalability Original code normal,
New code ideal Test environment BlueGene/L at
NCAR (HPCA2009)
64
Strong Scalability on TianHe-1A
2021/6/12
65
CLeXML Math Library
66
BLAS2 Performance MKL vs. CLeXML
67
HPC Software support for Earth System Modeling
  • Led by Tsinghua University
  • Tsinghua
  • Beihang University
  • Jiangnan Computing Institute
  • Peking University
  • Part of the national effort on climate change
    study

67
68
Earth System Model Development Workflow
68
69
Major research activities
70
(No Transcript)
71
Expected Results
71
72
Potential cooperation areas
  • Software for exa-scale computer systems
  • Power
  • Performance
  • Programmability
  • resilience
  • CPU/GPU hybrid programming
  • Parallel algorithms and parallel program
    frameworks
  • Large scale parallel applications support
  • Applications requiring ExaFlops computers

73
Thank you!
Write a Comment
User Comments (0)
About PowerShow.com