Title: Grid Application Programming Environments: Ibis, the GAT, and beyond
1Grid Application Programming EnvironmentsIbis,
the GAT, and beyond
- Thilo Kielmann
- Vrije Universiteit, Amsterdam
- kielmann_at_cs.vu.nl
2CoreGRID Network of Excellence
- Funded by European Commission (IST, 6th
Framework) - 8.2MEuro, for 4 years, started Sep 2004
- Goal integrating the research of the major
European groups working on Grids - Currently 42 partner sites
- 6 Virtual Institutes
- Knowledge and Data Management
- Programming Models (VUA)
- System Architecture
- Information and Monitoring Services
- Resource Management and Scheduling
- Systems, Tools, and Environments (VUA)
3A Grid Application Execution Scenario
4Functional Properties
- What applications need to do
- Access to compute resources, job spawning and
scheduling - Access to file and data resources
- Communication between parallel and distributed
processes - Application monitoring and steering
5Non-functional Properties
- What else needs to be taken care of
- Performance
- Fault tolerance
- Security and trust
- Platform independence
6Ibis
7The Ibis System
- Java centric write once, run anywhere
- Efficient communication (pure Java or native)
- Programming models
- RMI (remote method invocation)
- GMI (group method invocation)
- Collective communication (MPI-like) and more
- RepMI (replicated method invocation)
- Strong consistency
- Satin (divide and conquer)
- MPJ (Java binding for MPI)
8Satin Divide-and-conquer
- Effective paradigm for Grid applications
(hierarchical) - Satin Grid-friendly load balancing (aware of
cluster hierarchy) - Missing support for
- Fault tolerance
- Malleability
- Migration
9Satin Example Fibonacci
class Fib int fib (int n) if (n lt 2)
return n int x fib(n-1) int y
fib(n-2) return x y
Single-threaded Java
10Satin Example Fibonacci
- public interface FibInter extends
ibis.satin.Spawnable - public int fib (int n)
-
- class Fib extends ibis.satin.SatinObject
- implements FibInter
- public int fib (int n)
- if (n lt 2) return n
- int x fib(n-1) /spawned/
- int y fib(n-2) /spawned/
- sync()
- return x y
-
11Satin Grid-friendly load balancing (aware of
cluster hierarchy)
- Random Stealing (RS)
- Provably optimal on a single cluster (Cilk)
- Problems on multiple clusters
- (C-1)/C stealing over WAN
- Synchronous protocol
- Satin Cluster-aware Random Stealing (CRS)
- When idle
- Send asynchronous steal request to random node in
different cluster - In the meantime steal locally (synchronously)
- Only one wide-area steal request at a time
12Satin Fault-tolerance, malleability, migration
- Can be implemented by handling processors joining
or leaving the ongoing computation - Processors may leave either unexpectedly (crash)
or gracefully - Handling joining processors is trivial
- Let them start stealing jobs
- Handling leaving processors is harder
- Recompute missing jobs
- Problems orphan jobs, partial results from
gracefully leaving processors
13Summary Ibis
- Java-centric Grid programming environment
- Write once, run anywhere
- Efficient communication
- Satin Divide-and-Conquer
- CRS work stealing for clusters
- Fault tolerance integrated
14The Grid Application Toolkit (GAT)
15Simple API for Grid Applications (SAGA)
- Follow-up to GAT efforts
- SAGA-RG within GGF is working on an upcoming
standard for a simple Grid API - 80 / 20 rule
- With 20 of the effort achieve 80 of the
functionality - Strictly driven by application use cases
- API defines large part of a Grid programming model
16What belongs to a Grid API?
- Functionality, e.g. (as in SAGA today)
- Jobs (submission and management)
- Files (and logical/replicated files)
- Streams
- Later in SAGA
- Steering, monitoring, workflow, GridRPC, GridCPR
- Security
- Error handling
- Asynchronous operation (-gt Tasks)
17SAGA Jobs
- interface Job
- void getJobId (out string
jobId) - void getJobState (out JobState
state) - void getJobInfo (out JobInfo
info) - void getJobDefinition (out JobDefinition
jobDef) - void getJobExitStatus (out JobExitStatus
exitStatus) - void suspend ()
- void resume ()
- void hold ()
- void release ()
- void checkpoint ()
- void migrate (in JobDefinition
jobDef) - void terminate ()
- void signal (in int
signum) -
18SAGA JobService
- interface JobService
- void submitJob (in JobDefinition
jobDef, - out Job
job) - void runJob (in string
host, - in string
commandline, - out opaque
stdin, - out opaque
stdout, - out opaque
stderr, - out Job
job) - void list (out arrayltstring,1gt
jobIdList) - void getJob (in string
jobID, - out Job
job) -
19SAGA Files
- class File
- void read (in long
len_in, - out string
buffer, - out long
len_out ) - void write (in long
len_in, - in string
buffer, - out long
len_out ) - void seek (in long
offset, - in SeekMode
whence, - out long
position ) - void readV (inout arrayltivecgt
ivec) - void writeV (inout arrayltivecgt
ivec) - ...
-
- (directories left out for brevity)
20SAGA Replicated Files
- class LogicalFile
- void addLocation (in name
) - void removeLocation (in name
) - void listLocations (out
arrayltstring,1gt names ) - void replicate (in name
) -
- (directories left out for brevity)
21SAGA Security
- enum contextType
- X509 0,
- MyProxy 1,
- SSH 2
- Kerberos 3,
- UserPass 4
-
- interface Context extends-all SAGA.Attribute
- constructor (in contextType type)
- getType (out contextType type)
-
- Every SAGA object gets a Context as parameter to
its constructor.
22SAGA Error Handling
- enum ExceptionCategory
- LibraryRecoverableError,
- LibraryFatalError,
- BackEndRecoverableError,
- BackEndFatalError,
-
- interface Exception extends sidl.SIDLException
- getExceptionCategory (out
ExceptionCategory category) - getMessage (out String
message) -
- interface ErrorHandler
- hasError (out boolean
state) - getErrorObject (out Exception
error) -
- Each SAGA object implements ErrorHandler.
23SAGA Tasks
- Asynchronous operations
- Bulk (async.) operations
- Single-threaded implementation support
- package Task
- enum State
- Pending 0,
- Running 1,
- Finished 2,
- Cancelled 3
-
- ...
24Tasks and Containers
- interface Task
- void run ()
- void wait (in double timeout,
- out boolean finished)
- void cancel ()
- void getState (out State state)
-
- class TaskContainer
- void addTask (in Task
task) - void removeTask (in Task
task) - void run ()
- void wait (in double
timeout, - out arrayltTask,1gt
finished) - void cancel ()
- void getStates (out arrayltState,1gt
states) - void listTasks (out arrayltTask,1gt
tasks) -
25Instantiating Tasks
- Have three versions of each operation
- Synchronous
- Asynchronous (start immediately)
- Task (start explicitly)
- d.mkdir ("test/")
- sagatask t_1 d.mkdir_async ("test/")
- sagatask t_2 d.mkdir_task ("test/")
- t_2.run ()
- t_1.wait ()
- t_2.wait ()
26Summary SAGA
- Standardize a simple API for Grid applications
- Driven by user communities
- API completion Q1/2006
- First engine implementation in C
(SAGA-A)almost complete
27Beyond Ibis and GAT/SAGASynthesizing a Generic
Architecture
28Beyond Ibis and GAT/SAGABuilding
Component-based Grid Systems
29Conclusions
- Ibis efficient communication for
(Java-based)Grid applications - GAT/SAGA Simple API to vaarious Grid systemsand
middlewares - Work in progress
- designing/building a generic grid platform
- www.coregrid.net
- www.cs.vu.nl/ibis
- wiki.cct.lsu.edu/saga/space/start
30Acknowledgements
- Ibis
- Henri Bal, Jason Maassen, Rob van
Nieuwpoort,Gosia Wrzesinska, Ceriel Jacobs - GAT / SAGA
- Rob van Nieuwpoort, Jason Maassen,Andre Merzky,
Michel Zandstra, Stephan HirmerHartmut Kaiser,
Shantenu Jha, Tom Goodale,... - Funding
- European Commission (CoreGRID, GridLab)
- Dutch Ministry of Education, Culture, and Science
(OCW) via www.vl-e.nl