Brief presentation of Earth Simulation Center - PowerPoint PPT Presentation

1 / 57
About This Presentation
Title:

Brief presentation of Earth Simulation Center

Description:

Highly parallel vector supercomputer of the distributed-memory type. 640 Processor nodes (PNs) ... Daewoo Lee. CS610. Terascale Cluster: System X ... – PowerPoint PPT presentation

Number of Views:310
Avg rating:3.0/5.0
Slides: 58
Provided by: camarsK
Category:

less

Transcript and Presenter's Notes

Title: Brief presentation of Earth Simulation Center


1
Brief presentation of Earth Simulation Center
  • Jang, Jae-Wan

2
Hardware configuration
  • Highly parallel vector supercomputer of the
    distributed-memory type
  • 640 Processor nodes (PNs)
  • PN
  • 8 vector-type arithmetic processors (APs)
  • 16 GB main momory
  • Remote control and I/O parts

3
Arithmetic processor
4
Processor node
5
Processor node
6
Interconnection network
7
Interconnection Network
8
65m
50m
Earth Simulator Research and Development Center
9
Software
  • OS
  • NECs UNIX-based OS SUPER-UX
  • Programming model
  • Supported language
  • Fortran90, C, C (modified for ES)

hybrid flat
Inter-PN HPF/MPI HPF/MPI
Intra-PN Microtasking/OpenMP HPF/MPI
AP Automatic vectoriztion Automatic vectoriztion
10
Earth Simulator Center
First results from the Earth Simulator
Resolution ? 300km
11
Earth Simulator Center
First results from the Earth Simulator
Resolution ? 120km
12
Earth Simulator Center
First results from the Earth Simulator
Resolution ? 20km
13
Earth Simulator Center
First results from the Earth Simulator
Resolution ? 10km
14
First results from the Earth Simulator
? resolution 0.1º 0.1º ( ? 10km) ? initial
condition Levitus data (1982)? computer
resources number of nodes 175,
elapsed time ? 8,100
hours
15
First results from the Earth Simulator
16
Terascale ClusterSystem X
  • Virginia Tech, Apple, Mellanox, Cisco, and
    Liebert
  • 2003. 3. 16
  • Daewoo Lee

17
Terascale Cluster System X
  • A Groundbreaking Supercomputer Cluster with
    Industrial Assistance
  • Apple, Mellanox, Cisco, and Liebert
  • 5.2 million for hardware
  • 10280/17600 GFlops of Performance with 1100 Nodes
    (3rd Ranked in TOP500 Supercomputer Site)

18
Goals
Dual Usage Mode (90 of computational cycles
devoted to production use)
19
Hardware Architecture
Node Apple G5 Platform Dual IBM PowerPC 970 (64-bit CPU)
Primary Communication InfiniBand by Mellanox (20Gbps full duplex, fat-tree topology)
Secondary Communication Gigabit Ethernet by Cisco
Cooling System by Liebert
20
Software
  • Mac OS X (FreeBSD based)
  • MPI-2 (MPICH-2)
  • Support C/C/Fortran compilation
  • Déjà vu transparent fault-tolerance system
  • Maintain computer stability by transferring a
    failed application to another location without
    alerting the computer, thus keeping the
    application intact.

21
Reference
  • Terascale Cluster Web Site
  • http//computing.vt.edu/research_computing/terasca
    le

22
4th fastest supercomputerTungsten
PAK, EUNJI
23
4th NCSA Tungsten
  • Top500.org
  • National Center for Supercomputing Applications
    (NCSA)
  • University of Illinois at Urbana-Champaign

24
Tungsten Architecture 1/3
  • Tungsten
  • Xeon 3.0 GHz Dell cluster
  • 2,560 processors
  • 3 GB memory/node
  • Peak performance 15.36 TF
  • Top 500 list debut 4 (9.819 TF, November 2003)
  • Currently 4th fastest supercomputer in the world

25
Tungsten Architecture 2/3
  • Components

26
Tungsten Architecture 3/3
  • 1450 nodes
  • Dell PowerEdge 1750 Server
  • Intel Xeon 3.06GHZ Peak performance 6.12GFLOPS
  • 1280 compute nodes, 104 I/O nodes
  • Parallel I/O
  • 11.1 Gigabytes per second (GB/s) of I/O
    throughput
  • Complements the clusters 9.8TFLOPS of
    computational capability
  • 104 node I/O sub-cluster with more than 120TB
  • Node local 73GB, Shared 122TB

27
Applications on Tungsten 1/3
  • PAPI and PerfSuite
  • PAPI Portable interface to hardware performance
    counters
  • PerfSuite Set of tools for performance analysis
    on Linux platforms

28
Applications on Tungsten 2/3
  • PAPI and PerfSuite

29
Applications on Tungsten 3/3
  • CHARMM (Harvard Version)
  • Chemistry at Harvard Macromolecular Mechanics
  • General purpose molecular mechanics, molecular
    dynamics and vibrational analysis packages
  • Amber 7.0
  • A set of molecular mechanical force fields for
    the simulation of bimolecular
  • Package of molecular simulation programs

30
MPP2 SupercomputerThe worlds largest Itanium2
cluster.
  • Molecular Science Computing Facility
  • Pacific Northwest National Laboratory
  • 2004. 3. 16
  • Presentation Kim SangWon

31
Contents
  • MPP2 Supercomputer Overview
  • Configuration
  • HP rx2600(Longs Peak) Node
  • QsNet ELAN Interconnect Network
  • System/Application Software
  • File System
  • Future Plan

32
MPP2 Overview
  • MPP2
  • The High Performance Computing System-2
  • At the Molecular Science Computing Facilityin
    the William R. Wiley Environmental Molecular
    Sciences Laboratoryat Pacific Northwest National
    Laboratory
  • the fifth-fastest supercomputer in the world in
    the November 2003

33
MPP2 Overview
  • System Name Mpp2
  • Linux Supercomputer cluster
  • 11.8(8.633) Teraflops
  • 6.8 Terabytes of memory
  • Purpose Production
  • Platform HP Integrity rx2600
    bi-Itanium2 1,5 Ghz
  • Nodes 980 (Processors 1960)
  • ¾ Megawatt of power
  • 220 Tons of Air Conditioning
  • 4,000 Sq. Ft.
  • Cost 24.5 million (estimated)

UPS
Generator
34
Configuration(Phase2b)
Operational September 2003
1,900 next generation Itanium processors
11.4TF 6.8TB Memory
1,856 Madison Batch CPUs
928 compute nodes
...
Elan4 Not Operational
Elan4
Elan3

Lustre
SAN / 53TB
2 System Mgt nodes
4 Login nodes with 4Gb-Enet
35
HP rx2600 Longs Peak Node Architecture
  • Each node has
  • 2 Intel Itanium 2 Processors(1.5Ghz)
  • 6.4GB/s System bus
  • 8.5GB/s Memory bus
  • 12GB of RAM
  • 1 1000T Connection
  • 1 100T Connection
  • 1 Serial Connection
  • 2 Elan3 Connections

Elan3
PCI-X2 (1GB/s)
Elan3
2SCSI160
36
QsNet ELAN Interconnect Network
  • High bandwidth, Ultra low latency and scalability
  • 900Mbytes/s user space to user space bandwidth.
  • 1024 nodes for standard QsNet conf., rising to
    4096 in QsNetII systems.
  • Optimized libraries for common distributed memory
    programming models exploit the full capabilities
    of the base hardware.

37
Software on MPP2 (1/2)
  • System Software
  • Operating System - Red Hat Linux 7.2 Advanced
    Server
  • NWLinux tailored to IA64 clusters (2.4.18
    kernel with various patches)
  • Cluster Management Resource Management
    System(RMS) by Quadrix
  • A single point interface to the system for
    resource management
  • Monitoring, Fault diagnosis, Data collection,
    Allocating CPUs, Parallel jobs execution
  • Job Management Software
  • LSF(Load Sharing Facility) Batch Scheduler
  • QBank Control and Manage CPU resources
    allocated to projects or users.
  • Compiler Software
  • C (ecc), F77/F90/F95 (efc), G
  • Code Development
  • Etnus TotalView
  • A parallel and multithreaded application debugger
  • Vampir
  • the GUI driven frontend used to visualize the
    profile data of running a program
  • gdb

38
Software on MPP2 (2/2)
  • Application Software
  • Quantum Chemistry Codes
  • GAMESS(The General Atomic and Molecular
    Electronic Structure System)
  • performing a variety of ab initio molecular
    orbital (MO) calculations
  • MOLPRO
  • an advanced ab initio quantum chemistry software
    package
  • NWChem
  • computational chemistry software developed by
    EMSL
  • ADF (Amsterdam Density Functional) 2000
  • software for first-principle electronic structure
    calculations via Density-Functional Theory (DFT)
  • General Molecular Modeling Software
  • Amber
  • Unstructured Mesh Modeling Codes
  • NWGrid (Grid Generator)
  • hybrid mesh generation, mesh optimization, and
    dynamic mesh maintenance
  • NWPhys (Unstructured Mesh Solvers)
  • a 3D, full-physics, first principles,
    time-domain, free-Lagrange code for parallel
    processing using hybrid grids.

39
File System on MPP2
  • Four file systems available on the cluster
  • Local filesystem(/scratch)
  • On each of the compute nodes
  • Non-persistent storage area provided to a
    parallel job running on that node.
  • NFS filesystem(/home)
  • User home directory and files are located.
  • Uses RAID-5 for reliability
  • Lustre Global filesystem(/dtemp)
  • Designed for the world's largest high-performance
    compute clusters.
  • Aggregate write rate of 3.2 Gbyte/s.
  • Restart files and files needed for post analysis.
  • Long term global scratch space
  • AFS filesystem(/msrc)
  • On the front-end (non-compute) nodes

40
Future Plan
  • MPP2 will be upgraded with the faster Quadrics
    QsNetII interconnect in early 2004

928 compute nodes
1,856 Madison Batch CPUs
...
Elan4

Lustre
SAN / 53TB
4 Login nodes with 4Gb-Enet
2 System Mgt nodes
41
Bluesky Supercomputer
  • Top 500 Supercomputers
  • CS610 Parallel Processing
  • Donghyouk Lim
  • (Dept of Computer Science, KAIST)

42
Contents
  • Introduction
  • National Center for Atmosphere Research
  • Scientific Computing Division
  • Hardware
  • Software
  • Recommendations for usage
  • Related Link

43
Introduction
  • Bluesky
  • 13th Supercomputer in the world
  • Clustered Symmetric Multi-Processing(SMP) System
  • 1600 IBM Power 4 processor
  • Peak of 8.7 TFLOP

44
National Center for Atmosphere Research
  • Established in 1960
  • Located in Boulder, Colorado
  • Research area
  • Earth system
  • Climate change
  • Changes in atmospheric composition

45
Scientific Computing Division
  • Research on high-performance supercomputing
  • Computing resources
  • Bluesky (IBM Cluster 1600 running AIX) 13th
    place
  • blackforest (IBM SP RS/6000 running AIX) 80th
    place
  • Chinook complex Chinook (SGI Origin3800 running
    IRIX) and Chinook (SGI Origin2100 running IRIX)

46
Hardware
  • Processor
  • 1600 Power 4 Processors 1.3 GHz
  • each can perform up to 4 fp operations per cycle
  • Peak of 8.7 TFLOPS
  • Memory
  • 2 GB memory per processor
  • memory on a node is shared between processors on
    that node
  • Memory Caches
  • L1 cache 64KB I-cache, 32KB d-cache, direct
    mapped
  • L2 cache For pair of processors, 1.44MB, 8-way
    set associative
  • L3 cache 32MB, 512byte cache line, 8-way set
    associative

47
Hardware
  • Computing Nodes
  • 8-way processor nodes 76
  • 32-way processor nodes 25
  • 32-processor nodes for running interactive jobs
    4
  • Separate nodes for user logins
  • System support nodes
  • 12 nodes dedicated to the General Parallel File
    System (GPFS)
  • Four nodes dedicated to HiPPI communications to
    the Mass Storage System
  • Two master nodes dedicated to controlling
    LoadLeveler operations
  • One dedicated system monitoring node
  • One dedicated test node for system
    administration, upgrades, testing

48
Hardware
  • Storage
  • RAID disk storage capacity 31.0 TB total
  • Each user application can access 120 GB of
    temporary space
  • Interconnect fabric
  • SP switch2 (Colony switch)
  • Two full duplex network path to increase
    throughput
  • Bandwidth 1.0GB per second bidirectional
  • Worst case latency 2.5 microsecond
  • HiPPI(High-Performance Parallel Interface) to the
    Mass Storage System
  • Gigabit Ethernet network

49
Software
  • Operating System AIX (IBM-proprietary UNIX)
  • Compilers Fortran (95/90/77), C, C
  • Batch subsystem LoadLeveler
  • Managing serial and parallel jobs over a cluster
    of servers
  • File System General Parallel File System (GPFS)
  • System information commands spinfo for general
    information, lslpp for information about
    libraries

50
Related Links
  • NCAR http//www.ncar.ucar.edu/ncar/
  • SCD http//www.scd.ucar.edu/
  • Bluesky http//www.scd.ucar.edu/computers/bluesk
    y/
  • IBM p690 http//www-903.ibm.com/kr/eserver/pseri
    es/highend/p690.html

51
About Cray X1
  • Kim, SooYoung (sykim_at_camars.kaist.ac.kr)
  • (Dept of Computer Science, KAIST)

52
Features (1/2)
  • Contributing areas
  • weather and climate prediction, aerospace
    engineering, automotive design, and a wide
    variety of other applications important in
    government and academic research
  • Army High Performance Computing Research Center
    (AHPCRC), Boeing, Ford, Warsaw Univ., U.S.
    Government, Department of Energy's Oak Ridge
    National Laboratory (ORNL)
  • Operating System UNICOS/mptm from UNICOS,
    UNICOS/mktm
  • True single system image (SSI)
  • Scheduling algorithms for parallel applications
  • Accelerated application mode and migration
  • Variable processor utilization Each CPU has four
    internal processors
  • Together as a closely coupled, multistreaming
    processor (MSP)
  • Individually as four single-streaming processors
    (SSPs)
  • Flexible system partitioning

53
Features (2/2)
  • Scalable system architecture
  • Distributed shared memory (DSM)
  • Scalable cache coherence protocol
  • Scalable address translation
  • Parallel programming models
  • Shared-memory parallel models
  • Traditional distributed-memory parallel models
    MPI and SHMEM
  • Up-and-coming global distributed-memory parallel
    models Unified Parallel C(UPC)
  • Programming environments
  • Fortran compiler, C and C compiler
  • High-performance scientific library (LibSci),
    language support libraries, system libraries
  • Etnus TotalView debugger, CrayPat (Cray
    Performance Analysis Tool)

54
Node Architecture
Figure 1. Node, Containing Four MSPs
55
System Conf. Examples
Cabinets CPUs Memory Peak Performance
1 (AC) 16 64 256 GB 204.8 Gflops
1 64 256 1024 GB 819.0 Gflops
4 256 1024 4096 GB 3.3 Tflops
8 512 2048 8192 GB 6.6 Tflops
16 1024 4096 16384 GB 13.1 Tflops
32 2048 8192 32768 GB 26.2 Tflops
64 4096 16384 65536 GB 52.4 Tflops
56
Technical Data (1/2)
Technical specifications Technical specifications
Peak performance 52.4 Tflops in a 64 cabinet configuration
Architecture Scalable vector MPP with SMP nodes
Processing element Processing element
Processor Cray custom design vector CPU 16 vector floating-point operations/clock cycle 32- and 64-bit IEEE arithmetic
Memory size 16 to 64GB per node
Data error protection SECDED
Vector clock speed 800MHz
Peak performance 12.8 Gflops per CPU
Peak memory bandwidth 34.1 GB/sec per CPU
Peak cache bandwidth 76.8 GB/sec per CPU
Packaging 4 CPUs per node Up to 4 nodes per AC cabinet, up to 4 interconnected cabinets Up to 16 nodes per LC cabinet, up to 64 interconnected cabinets
57
Technical Data (2/2)
Memory Memory
Technology RDRAM with 204 GB/sec peak bandwidth per node
Architecture Cache coherent, physically distributed, globally addressable
Total system memory size 32 GB to 64 TB
Interconnect network Interconnect network
Topology Modified 2D torus
Peak global bandwidth 400 GB/sec for a 64-CPU Liquid Cooled (LC) system
I/O I/O
I/O system port channels 4 per node
Peak I/O bandwidth 1.2 GB/sec per channel
Write a Comment
User Comments (0)
About PowerShow.com