Title: October, 2003
1From Clusters to Grids
- October, 2003 Linkoping, Sweden
- Andrew Grimshaw
- Department of Computer Science, Virginia
- CTO Founder Avaki Corporation
2Agenda
- Grid Computing Background
- Legion
- Existing Systems Standards
- Summary
3Grid Computing
4First What is a Grid System?
- A Grid system is a collection
- of distributed resources
- connected by a network
- Examples of Distributed Resources
- Desktop
- Handheld hosts
- Devices with embedded processing resources such
as digital cameras and phones - Tera-scale supercomputers
5What is a Grid?
A grid is all about gathering together resources
and making them accessible to users and
applications.
- A grid enables users to collaborate securely
by sharing processing, applications, and data
across heterogeneous systems and administrative
domains for collaboration, faster application
execution and easier access to data. - Compute Grids
- Data Grids
6What are the characteristics of a Grid system?
Ownership by Mutually Distrustful Organizations
Individuals
Connected by Heterogeneous, Multi-Level Networks
Different Security Requirements Policies
Required
Different Resource Management Policies
Potentially Faulty Resources
Geographically Separated
Resources are Heterogeneous
7What are the characteristics of a Grid system?
Ownership by Mutually Distrustful Organizations
Individuals
Connected by Heterogeneous, Multi-Level Networks
Different Security Requirements Policies
Required
Different Resource Management Policies
Potentially Faulty Resources
Geographically Separated
Resources are Heterogeneous
8Technical Requirements of a Successful Grid
Architecture
- Simple
- Secure
- Scalable
- Extensible
- Site Autonomy
- Persistence I/O
- Multi-Language
- Legacy Support
- Single Namespace
- Transparency
- Heterogeneity
- Fault-tolerance Exception Management
Manage Complexity!!
9ImplicationComplexity is THE Critical
Challenge
How should complexity be addressed?
10As Application Complexity Increases, Differences
Between the Systems Increase Dramatically
High-level versus low-level solutions
High
Time Cost
Robustness
Low
Low
High
11The Importance of Integration in a Grid
Architecture
- If separate pieces are used, then the programmer
must integrate the solutions. - If all the pieces are not present, then the
programmer must develop enough of the missing
pieces to support the application.
Bottom Line Both raise the bar by putting the
cognitive burden on the programmer.
12Misconceptions about Grids
- Simple cycle aggregation
- State of the state is essentially scheduling and
queuing for CPU cluster management - These definitions are selling short the promise
of Grid technology - AVAKI believes grids are not just about
aggregating and scheduling CPU cycles but also - Virtualizing many types of resources, internally
and across domains - Empowering anyone to have secure access to any
and all resources through easy administration
13Compute Grids Categories
- Sons of SETI_at_home
- United Devices, Entropia, Data Synapse
- Low-end, desktop cycle aggregation
- Hard sell in corporate America
- Cluster Load Management
- LSF, PBS, SGE
- High end, great for management of local clusters
but not well proven in multi-cluster environments - As soon as you go outside of the local cluster to
cross-domain multi-cluster, the game changes
dramatically with the introduction of three major
issues - Data
- Security
- Administration
To address these issues, you need a
fully-integrated solution, or a toolkit to build
one
14Typical Grid Scenarios
- Global Grids
- Multiple enterprises, owners,
- platforms, domains, file systems,
- locations, and security policies
- Legion, Avaki, Globus
- Enterprise Grids
- Single enterprise multiple owners, platforms,
- domains, file systems, locations, and security
policies - SUN SGE EE, Platform Multi-cluster
- Cluster Departmental Grids
- Single owner, platform,
- domain, file system and location
- SUN SGE, Platform LSF, PBS
- Desktop Cycle Aggregation
- Desktop only
- United Devices,
- Entropia,
- Data Synapse
15What are grids being used for today?
- Multiple sites with multiple data sources (public
and private) - Need secure access to data and applications for
sharing - Have partnership relationships with other
organizations internal, partners, or customers - Computationally challenging applications
- Distributed RD groups across company, networks
and geographies - Staging large files
- Want to utilize and leverage heterogeneous
compute resources - Need for accounting of resources
- Need to handle multiple queuing systems
- Considering purchasing compute cycles for spikes
in demand
16Legion
17Legion Grid Software
Users
Applications
Legion G R I D
Wide-area access to data, processing and
application resources in a single, uniform
operating environment that is secure and easy to
administer
Load Mgmt Queuing
Load Mgmt Queuing
Server
Application
Data
Desktop Server
Cluster
Application
Data
Server
Data
Department A
Partner
Vendor
Department B
18Legion Combines Data and Compute Grid
Users
Applications
Legion G R I D
Compute
Data
Load Mgmt Queuing
Load Mgmt Queuing
Server
Application
Data
Desktop Server
Cluster
Application
Data
Server
Data
Department A
Partner
Vendor
Department B
19The Legion Data Grid
20Data Grid
Users
Applications
Legion G R I D
Wide-area access to data at its source location
based on business policies, eliminating manual
copying and errors caused by accessing
out-of-date copies
Server
Desktop Server
Application
Data
Cluster
Application
Data
Server
Data
Department A
Partner
Vendor
Department B
21Data Grid Share
Legion Data Grid transparently handles client and
application requests, maps them to the global
namespace, and returns the data
Users
Applications
Data mapped to Grid namespace via Legion ExportDir
Linux
NT
Solaris
Solaris
Headquarters
Informatics Partner
Tools Vendor
Research Center
22Data Grid Access
- Access files using
- standard NFS
- protocol or Legion
- commands
- - NFS security issues eliminated
- - Caches exploit semantics
- Access files using
- global name
- Access based on
- specified privileges
Users
Applications
Data
PM-1
sequence_a
Cluster HQ - 1
Server RD - 2
App_A
sequence_b
sequence_c
Cluster
BLAST
Headquarters
Informatics Partner
Tools Vendor
Research Center
23Data Grid Access using virtual NFS
- Complexity Servers Clients
- Clients mount grid
- Servers share files to grid
- Clients access data using
- NFS protocol
- Wide-area access to data
- outside administrative
- domain
Data
sequence_a
sequence_c
Department A
Department B
Partner
24Keeping Data in the grid
- Legion storage servers
- Data is copied into Legion storage servers that
execute on a set of hosts. - The particular set hosts used is a configuration
option - here five hosts are used - Access to the different files is completely
independent and asynchronous - Very high sustained read/write bandwidth is
possible using commodity resources
Local Disk
Local Disk
Local Disk
Local Disk
Local Disk
25I/O Performance
Read performance in NFS, Legion-NFS, and Legion
I/Olibraries. The x axis indicates the number of
clients that simultaneously perform 1 MB reads on
10 MB files, and the y axis indicates total read
bandwidth. All results are the average of
multiple runs. All clients on 400 MHZ Intels,
NFS server on 800 MHZ Intel server.
26Data Grid Benefits
- Easy, convenient, wide-area access to data
regardless of location, administrative domain or
platform - Eliminates time-consuming copying and obtaining
accounts on machines where data resides - Provides access to the most recent data available
- Eliminates confusion and errors caused by
inconsistent naming of data - Caches remote data for improved performance
- Requires no changes to legacy or commercial
applications - Protects data with fine-grained security and
limits access privileges to those required - Eases data administration and management
- Eases migration to new storage technologies
27The Legion Compute Grid
28Compute Grid
Users
Applications
Legion G R I D
Wide-area access to processing resources based on
business policies, managing utilization of
processing resources for fast, efficient job
completion
Server
Desktop Server
Application
Application
Data
Cluster
Application
Server
Data
Department A
Partner
Vendor
Department B
29Compute Grid Access
- The grid
- Locates resources
- Authenticates and
- grants access privileges
- Stages applications and data
- Detects failures and recovers
- Writes output to specified
- location
- Accounts for usage
Users
Applications
Compute
Application
Data
App_A
BLAST
Scheduling, Queuing, Usage Management,
Accounting, Recovery
NT Server PM-1
Data
Cluster HQ - 1
Solaris Server RD - 2
App_A
Data
Data
Linux Cluster
BLAST
Headquarters
Informatics Partner
Tools Vendor
Research Center
30Tools - All are cross-platform
- legion_make - remote builds
- Fault-tolerant MPI libraries
- post-mortem debugger
- console objects
- parallel 2D file objects
- Collections
- MPI
- P-space studies - multi-run
- Parallel C
- Parallel object-based Fortran
- CORBA binding
- Object migration
- Accounting
31One Favorite
32Related Work
33Related Work
- Avaki
- All distributed systems literature
- Globus
- AFS/DFS
- LSF, PBS, .
- Global Grid Forum - OGSA
34Avaki Company Background
- Grid Pioneers - a Legion spin-off
- Over 20M capitalization
- The only commercial grid software provider with a
solution that addresses data access, security,
and compute power challenges - Standards efforts leader
Standards Organizations
Partners
Customers
35AFS/DFS comparison with Legion Data Grid
- AFS presumes that all files kept in AFS - no
federation with other file systems. Legion allows
data to be kept in Legion, or in an NFS, XFS,
PFS, or Samba file system. - AFS presumes all sites using Kerberos and that
realms trust each other - Legion assumes
nothing about local authentication mechanism and
there is no need for cross-realm trust - AFS semantics are fixed - copy on open - Legion
can support multiple semantics. Default is Unix
semantics. - AFS volume oriented (sub-trees) - Legion can be
volume oriented or file oriented - AFS caching semantics not extensible - Legion
caching semantics are extensible
36Legion Globus GT2
- Projects with many common goals
- Metacomputing (or the Grid)
- Middleware for wide-area systems
- Heterogeneous resource sets
- Disjoint administrative domains
- High-performance, large-scale applications
37Legion Specific Goals
- Shared collaborative environment including shared
file system - Fault-tolerance and high-availability
- Both HPC applications and distributed
applications - Complete security model including access control
- Extensible
- Integrated - create a meta-operating system
38Many Similar Features
- Resource Management Support
- Message-passing libraries
- e.g., MPI
- Distributed I/O Facilities
- Globus GASS/remote I/O vs. Avaki Data Grid
- Security Infrastructure
39Globus
- The toolkit approach
- Provide services as separate libraries
- E.g. Nexus, GASS, LDAP
- Pros
- Decoupled architecture
- easy to add new services into the mix
- Low buy-in use only what you like!
- In practice all the pieces use each other
- Cons
- No unifying abstractions
- very complex environment to learn in full
- composition of services difficult as number of
services grows - Interfaces keep changing due to ever evolving
design - Does not cover space of problems
40Standards GGF
- Background
- Grid standards are now being developed at the
Global Grid Forum (GGF) - In-development standard, Open Grid Services
Infrastructure (OGSI) will extend Web Services
(SOAP/XML, WSDL, etc.) - Names and a two level name scheme
- Factories and lifetime management
- Mandatory set of interfaces, e.g., discovery
interfaces - OGSA Open Grid Services Architecture
- Over-arching architecture
- Still in development
41Summary
- Grids are about resource federation and sharing
- Grids are here today. They are being used in
production computing in industry to solve real
problems and provide real value. - Compute Grids
- Data Grids
- We believe that users want high-level
abstractions - and dont want to think about the
grid. - Need low activation energy and legacy support
- There are a number of challenges to be solved -
and different applications and organizations want
to solve them differently - Policy heterogeneity
- Strong separation of policy and mechanism
- Several areas where really good policies are
still lacking - Scheduling
- Security and security policy interactions
- Failure recovery (and the interaction of
different policies)