Title: IBM and GRID Computing
1How the Linux and Grid Communities can Build the
Next-Generation Internet Platform
Ian Foster Argonne National Lab University of
ChicagoGlobus Project
2Ottawa Linux Symposium, July 24, 2003
- Linux has gained tremendous traction as a
server operating system. However, a variety of
technology trends, the Grid being one, are
converging to create a service-based future in
which functions such as computing and storage are
virtualized and services and resources are
increasingly integrated within and across
enterprises. The servers that will power this
sort of environment will require new capabilities
including high scalability, integrated resource
management, and RAS. I discuss what I see as
development priorities if Linux is to retain its
leadership role as a server operating system.
3The (Power) GridOn-Demand Access to Electricity
Quality, economies of scale
Time
4By Analogy, A Computing Grid
- Decouple production and consumption
- Enable on-demand access
- Achieve economies of scale
- Enhance consumer flexibility
- Enable new devices
- On a variety of scales
- Department
- Campus
- Enterprise
- Internet
5Requirements
- Dynamically link resources/services
- From collaborators, customers, eUtilities,
(members of evolving virtual organization) - Into a virtual computing system
- Dynamic, multi-faceted system spanning
institutions and industries - Configured to meet instantaneous needs, for
- Multi-faceted QoX for demanding workloads
- Security, performance, reliability,
6For ExampleReal-Time Online Processing
Applications Delivery
Application Services Distribution
Servers Execution
7Examples of Linux-Based GridsHigh Energy Physics
- Production Run on the Integration Testbed
- Simulate 1.5 million full CMS events for physics
studies 500 sec per event on 850 MHz processor - 2 months continuous running across 5 testbed
sites - Managed by a single person at the US-CMS Tier 1
8Examples of Linux-Based GridsEarthquake
Engineering
U.Nevada Reno
www.neesgrid.org
9Grid Technologies Community
- Grid technologies developed since mid-90s
- Product of work on resource sharing for
scientific collaboration commercial adoption - Open source Globus Toolkit has emerged as a de
facto standard - International community of contributors
- Thousands of deployments worldwide
- Commercial support providers
- Global Grid Forum serves as a community and
standards body - Home to recent OGSA work
10The Emergence ofOpen Grid Standards
Increased functionality, standardization
Custom solutions
1990
1995
2000
2005
2010
11Open Grid Services Infrastructure (OGSI)
Resource allocation
Create Service
Authentication Authorization are applied to all
requests
Grid Service Handle
Service factory
Service requestor (e.g. user application)
Service data Keep-alives Notifications Service
invocation
Service discovery
Register Service
Service registry
Service instances
Interactions standardized using WSDL and SOAP
12Open Grid Services Architecture
Users in Problem Domain X
Applications in Problem Domain X
Application Integration Technology for Problem
Domain X
Generic Virtual Service Access and Integration
Layer
OGSA
OGSI Interface to Grid Infrastructure
Compute, Data Storage Resources
-
Distributed
Virtual Integration Architecture
13But Its Not Turtles All the Way Down
- Our ability to deliver virtualized services
efficiently and with desired QoX ultimately
depends on the underlying platform! - At multiple levels, including but not limited to
- Dynamic provisioning resource management
- Reliability, availability, manageability
- Performance and parallelism
- New demands on the OS in each area
14(1) Dynamic Provisioning
- Static provisioning dedicates resources
- Typical of co-lo hosting
- Reprovision manually as needed
- But load is dynamic
- Must overprovision for surges
- High variable cost of capacity
- Need dynamic provisioning toachieve true
economies of scale - Load multiplexing
- Tradeoff cost vs. quality
- Service level agreements
- Dynamic resource recruitment
15Load Is Dynamic
- ibm.com external site
- February 2001
- Daily fluctuations (3x)
- Workday cycle
- Weekends off
M T W Th F S S
- World Cup soccer site
- May-June 1998
- Seasonal fluctuations
- Event surges (11x)
- ita.ee.lbl.gov
Week 6 7 8
16For ExampleEnergy-Conscious Provisioning
- Light load concentrate traffic on a minimal set
of servers - Step down surplus servers to low-power state
- APM and ACPI
- Activate surplus servers on demand
- Wake-On-LAN
- Browndown provision for a specified energy
target - Even smarter also manage air conditioning
17Power Management via MUSEIBM Trace Run (Before)
Power draw (watts) Latency (ms50)
Throughput (requests/s)
1 ms
MUSE Jeff Chase et al., Duke University (SOSP
2003)
18Power Management via MUSEIBM Trace Run (After)
1 ms
MUSE Jeff Chase et al., Duke University (SOSP
2003)
19Dynamic Provisioning OS Issues
- Hot plug memory, CPU, and I/O
- For partitioning, core virtualization
capabilities - Security
- Containment data integrity in a virtualized
environment user-mode Linux? - Scheduler improvements for resource and workload
management - Allocate for required resource consumption
- Dynamic, sub processor logical partitioning
- Improved instrumentation accounting
- Determine actual resource consumption
20(2) Reliability, Availability, Manageablity
- Error log and diagnostics frameworks
- Foundation for automated error analysis and
recovery of distributed remote systems - Enable problem determination, automated
reconfiguration, localization of failure - Configuration management
- Determine hardware configuration/inventory
- Apply/remove service/support patches
- Isolate failing components quickly
21(3) Performance and ParallelismE.g., Data
Integration
- Assume
- Remote data at 1 GB/s
- 10 local bytes per remote
- 100 operations per byte
gt1 GByte/s achievable today (FAST, 7 streams,
LA?Geneva)
Local Network
Parallel computation 1000 Gop/s
Remote data
Wide area link (end-to-end switched lambda?) 1
GB/s
Parallel I/O 10 GB/s
22Performance and Parallelism
- Distributed/cluster/parallel file systems
- Optimized TCP/IP stacks
- Scheduling of computation communication
- Web100 configuration instrumentation
23Web100 Overcome TCP/IP Wizard Gap
24Web100 Kernel Instrument Set
- Definition
- Set of instruments designed to collect as much of
the information as possible to enable a user to
isolate the performance problems of a TCP
connection - How it is implemented
- Each instrument is a variable in a "stats"
structure that is linked through the kernel
socket structure - Linux /proc interface is used to expose these
instruments outside the kernel
25For Example
- Recent transAtlantic transfer showed frequent
drops in data rate - But no loss or retransmit
- Web100 identified problem as Linux send stall
congestion events
26Grid/Linux CooperationWe Have Testbeds, Users,
Applications
27Evolution of the Server
Increased Flexibility (and Complexity)
Significant implications for the underlying
operating system
Time
28Summary
- The Grid community is creating middlewarefor
distributed resource service sharing - Open source software for resource service
virtualization, service management/integration - Motivated by wonderful applications
- But we need help from the OS
- Linux the next-generation Internet platform?
- Could be but significant evolution is required
to address provisioning/resource management
availability, manageability performance and
parallelism and other issues - Grid community can provide testbeds, users,
requirements, applications
29For More Information
- The Globus Project
- www.globus.org
- Global Grid Forum
- www.ggf.org
- Background information
- www.mcs.anl.gov/foster
- GlobusWORLD 2004
- www.globusworld.org
- Jan 2023, San Fran
2nd Edition November 2003