Title: HighPerformance Computing on the Windows Server Platform
1High-Performance Computing on the Windows Server
Platform
- Marvin Theimer
- Software ArchitectWindows Server HPC
Grouphpcinfo _at_ microsoft.com - Microsoft Corporation
2Session Outline
- Brief introduction to HPC
- Definitions
- Market trends
- Overview of V1 version of Windows Server 2003 CCE
- Features
- System architecture
- Key challenges for future HPC systems
- Too many factors affect performance
- Grid computing economics
- Data management
3Brief Introduction to HPC
4Defining High Performance Computing (HPC)
HPC Definition Using compute resources to solve
computationally intensive problems
Different Platforms for Achieving Results
HPC Role in Science
Computational Modeling
Sensors
Persist (DB, FS, ..)
Technical andScientificComputing
HPC Use
Mining
Interpretation
5Cluster HPC Scenario
Head Node
User Mgmt
Cluster Mgmt
Resource Mgmt
Job Mgmt
Web service
Job
Policy, reports
User
Web page
Admin
Management
Input
Cmd line
Job
Sensors, Workflow, Computation
Data
Data mining, Visualization, Workflow Remote query
DB or FS
Cluster Node
High speed, low latency interconnect (1GE,
Infiniband, Myricom)
Job Mgr
User App
MPI
Resource Mgr
Node Mgr
6Top 500 Supercomputer Trends
Clusters over 50
Industry usage is rising
GigE is gaining
IA is winning
7Commoditized HPC Systems are Affecting Every
Vertical
- Leverage Volume Markets of Industry Standard
Hardware and Software. - Rapid Procurement, Installation and Integration
of systems - Cluster Ready Applications Accelerating Market
Growth - Engineering
- Bioinformatics
- Oil Gas
- Finance
- Entertainment
- Government/Research
The convergence of affordable high performance
hardware and commercial apps is making
supercomputing a mainstream market
8Supercomputing Yesterday vs. Today
9Cheap, Interactive HPC Systems Are Making
Supercomputing Personal
Grids of personal departmental clusters
Personal workstations departmental servers
Minicomputers
Mainframes
10The Evolving Nature of HPC
11Windows Server HPC
12Windows based HPC Today
- Technical Solution
- Partner Driven Solution Stack
LSF
PBSPro
DataSynapse
MSTI
Management
Parallel Applications
Applications
MPI/Pro
MPICH-1.2
WMPI
MPI-NT
Middleware
WINDOWS
Visual Studio
OS
TCP
Protocol
Gigabit Ethernet
Fast Ethernet
Interconnect
Intel (32bit 64bit) AMD x64
Platform
- Ecosystem
- Partnerships with ISV to develop on Windows
- Partnership with Cornell Theory Center
13What Windows-based HPC needs to provide
- Users require
- An integrated supported solution stack leveraging
the Windows infrastructure - Simplified job submission, status and progress
monitoring - Maximum compute performance and scalability
- Simplified environment from desktops to HPC
clusters - Administrators require
- Ease of setup and deployment
- Better cluster monitoring and management for
maximum resource utilization - Flexible, extensible, policy-driven job
scheduling and resource allocation - High availability
- Secure process startup and complete cleanup
- Developers Require
- Programming environment that enables high
productivity - Availability of optimized compilers (Fortran) and
math libraries - Parallel debugger, profiler, and visualization
tools - Parallel programming models (MPI)
14V1 Plans
- Introduce compute cluster solution
- Windows Server 2003 Compute Cluster Edition based
on Windows Server 2003 SP1 x64 Standard Edition - Features for Job Management, IT Admin and
Developers - Build partner eco-system around the Windows
Server Compute Cluster Edition from day one - Establish Microsoft credibility in the HPC
community - Create worldwide Centers of Innovation
15Technologies
- Platform
- Windows Server 2003 SP1 64 bit Edition
- x64 processors (Intel EM64T AMD Opteron)
- Ethernet, Ethernet over RDMA and Infiniband
support - Administration
- Prescriptive, simplified cluster setup and
administration - Scripted, image-based compute node management
- Active Directory based security, impersonation
and delegation - Cluster-wide job scheduling and resource
management - Development
- MPICH-2 from Argonne National Labs
- Cluster scheduler accessible via DCOM, http, and
Web Services - Visual Studio 2005 Compilers, Parallel Debugger
- Partner delivered compilers and libraries
16Windows HPC Environment
Microsoft Operations Manager
Head Node
Active Directory
User Mgmt
Cluster Mgmt
Resource Mgmt
Job Mgmt
Web service
Job
Policy, reports
User
Web page
Admin
Management
Input
Cmd line
Job
Sensors, Workflow, Computation
Windows Server 2003, Compute Cluster Edition
Data
Data mining, Visualization, Workflow Remote query
DB or FS
Cluster Node
High speed, low latency interconnect (Ethernet
over RDMA, Infiniband)
Job Mgr
User App
MPI
Resource Mgr
Node Mgr
17Architectural Overview
User Workstation
Cluster
Application
Job Scripts
Data
WSE
Head Node
COM
Windows XP
Job Sched UI
Job Scheduler
HTTP
GigE
X86/64
Disk
IIS6
WSE3
MSDE
RIS
AD
Whidbey
Developer Workstation
Windows Server 2003 CCE
Application
SFU
HPC SDK MPI Sched WS Policy API
COM
Compilers
Libs
WSE
Whidbey
WSE
HTTP
Cluster Nodes
Cluster Nodes
Windows XP
Node Manager
Node Manager
GigE
X86/64
Disk
HPC Application
HPC Application
MPI-2
MPI-2
Legend
MPI-2
MPI-2
TCP
SHM
WSD/SDP
TCP
SHM
WSD/SDP
Application
Windows Server 2003 CCE
Windows Server 2003 CCE
3rd Party
GigE/RDMA
Infiniband
GigE/RDMA
Infiniband
Windows OS
MS Component
HPC Component
18Key Challenges for Future HPC Systems
19Difficult to Tune Performance
- Example Tightly-coupled MPI applications
- Very sensitive to network performance
characteristics - Communication times measured in microseconds
O(10 usecs) for interconnects such as Infiniband
O(100 usecs) for GigE - OS network stack is a significant factor Things
like RDMA can make a big difference - Excited about the prospects of industry-standard
RDMA hardware - We are working with InfiniBand and GigE vendors
to ensure our stack supports them - Driver quality is an important facet
- We are supporting the OpenIB initiative
- Considering the creation of a WHQL program for
InfiniBand - Very sensitive to mismatched node performance
- Random OS activities can add millisecond delays
to microsecond communication times
20Need self-tuning systems
- Application configuration has a significant
impact - Incorrect assumptions about hardware/communication
s architecture can dramatically affect
performance - Choice of communication strategy
- Choice of communication granularity
-
- Tuning is an end-to-end issue
- OS support
- ISV library support
- ISV application support
21Computational Grid Economics
- What 1 will buy you (roughly)
- Computers cost 1000 (roughly)
- ? 1 cpu day ( 10 Tera-ops) 1
- (roughly, assuming 3 yr use cycle)
- ? 10TB network transfer costs 1
- (roughly, assuming 1Gbps interconnect)
- Internet bandwidth costs roughly 100
/mbps/month (not including routers and
management) - ? 1GB network transfer costs 1 (roughly)
- Some observations
- HPC cluster communication is 10,000x cheaper
than WAN communication - Break-even point for instructions computed per
byte transferred - Cluster O(1) instrs/byte
- WAN O(10,000) instrs/byte
22Computational Grid Economics Implications
- Small data, high compute applications work well
across the Internet, such as SETI_at_home and
Folding_at_home - MPI-style parallel, distributed applications work
well in clusters and across LANs, but are
uneconomic and do not work well in wide-area
settings - Data analysis is usually best done by moving the
programs to the data, not the data to the
programs. - Move questions and answers, not petabyte-scale
datasets - The Internet is NOT the cpu backplane (Internet-2
will not change this)
23Exploding Data Sizes
- Experimental data TBs ? PBs
- Modeling data
- Today 10s to 100s of GB is the common case
- Tomorrow TBs
- Near-future example CFD simulation of a turbine
engine - 109 mesh nodes, each containing 16
double-precision variables - ? 128 GB / time-step
- Simulate 1000s of time steps ? 100s TBs /
simulation - Archived for future reference
24Whole-System Modeling and Workflow
- Today mostly about computation
- Stand-alone static simulations of individual
parts/phenomena - Mostly batch
- Simple workflows short, deterministic pipelines
(though some are massively parallel) - Future mostly about data that is produced and
consumed by computational steps - Dynamic whole-system modeling via multiple,
interacting simulations - More complex workflows (don't yet know how
complex) - More interactive analysis
- More sharing
25Whole-System Modeling Example Turbine Engine
- Interacting simulations
- CFD simulation of dynamic airflow through turbine
- FE stress analysis of engine wing parts
- "Impedance" issues between various simulations
(time steps, meshes, ...) - Serial workflow steps
- Crack analysis of engine wing parts
- Visualization of results
26Interactive Workflow Example
- Base CFD simulation produces huge output
- Points of interest may not be easy to find
- Find and then focus on important details
- Data analysis/mining of output
- Restart simulation at a desired point in
time/space. - Visualize simulation from that point forward.
- Modify simulation from that point forward (e.g.
higher fidelity)
27Data Analysis and Mining
- Traditional approach
- Keep data in flat files
- Write C or Perl programs to compute specific
analysis queries - Problems with this approach
- Imposes significant development times
- Scientists must reinvent DB indexing and query
technologies - Results from the astronomy community
- Relational databases can yield speed-ups of one
to two orders of magnitude - SQL application/domain-specific stored
procedures greatly simplify creation of analysis
queries
28Combining Simulation with Experimental Data Drug
Discovery
- Clinical trial database describes toxicity side
effects observed for tested drugs. - Simulation searches for candidate compounds that
have a desired effect on a biological system. - Clinical data searched for drugs that contain a
candidate compound or "near neighbor" toxicity
results retrieved and used to decide if the
candidate compound should be rejected or not.
29Sharing
- Simulations (or ensembles of simulations) mostly
done in isolation - No sharing except for archival output
- Some coarse-grained sharing
- Check-out/check-in of large components
- Example automotive design
- Check-out component
- CAE-based design simulation of component
- Check-in with design rule checking step
- Data warehouses typically only need
coarse-grained update granularity - Bulk or coarse-grained updates
- Modeling simulations done in the context of
particular versions of the data - Audit trails and reproducible workflows becoming
increasingly important
30Data Management Needs
- Cluster file systems and/or parallel DBs to
handle I/O bandwidth needs of large, parallel,
distributed applications - Data warehouses for experimental data and
archived simulation output - Coarse-grained geographic replication to
accommodate distributed workforces and workflows - Indexing and query capabilities to do data mining
analysis - Audit trails, workflow recorders, etc.
31Windows HPC Roadmap
32Call To Action
- IHVs
- Develop Winsock Direct drivers for your RDMA
cards - Automatically let our MPI stack take advantage of
low latency - Develop support for diskless scenarios (e.g.
iScsi) - OEMs
- Offer turn-key clusters
- Pre-wired for management and RDMA networks
- Support boot from net diskless scenarios
- Support WS-Management
- Consider noise and power requirements for
personal and workgroup configurations
33Community Resources
- Windows Hardware Driver Central (WHDC)
- www.microsoft.com/whdc/default.mspx
- Technical Communities
- www.microsoft.com/communities/products/default.msp
x - Non-Microsoft Community Sites
- www.microsoft.com/communities/related/default.mspx
- Microsoft Public Newsgroups
- www.microsoft.com/communities/newsgroups
- Technical Chats and Webcasts
- www.microsoft.com/communities/chats/default.mspx
- www.microsoft.com/webcasts
- Microsoft Blogs
- www.microsoft.com/communities/blogs
34Related WinHEC Sessions
- TWNE05005 Winsock Direct Value
Proposition-Partner Concepts - TWNE05006 Implementing Convergent
Networking-Partner Concepts
35To Learn More
- Microsoft
- Microsoft HPC website http//www.microsoft.com/hp
c/ - Other Sites
- CTC Activities http//cmssrv.tc.cornell.edu/ctc/w
inhpc/ - 3rd Party Windows Cluster Resource Centre
www.windowsclusters.org - HPC related-links web site http//www.microsoft.c
om/windows2000/hpc/miscresources.asp - Some useful articles presentations
- Supercomputing in the Third Millenium, by
George Spix http//www.microsoft.com/windows2000
/hpc/supercom.asp - Introduction of the book Beowulf Cluster
Computing with Windows by Thomas Sterling,
Gordon Bell, and Janusz Kowalik - Distributed Computing Economics, by Jim Gray
MSR-TR-2003-24 http//research.microsoft.com/res
earch/pubs/view.aspx?tr_id655 - Web Services, Large Databases, and what
Microsoft is doing in the Grid Computing Space,
presentation by Jim Gray http//research.microsof
t.com/Gray/talks/WebServices_Grid.ppt - Send questions to hpcinfo _at_ microsoft.com