Title: The Grid: Experience and Practice
1The Grid Experience and Practice
Seminar April 14th 2004
- Mark Baker
- The Distributed Systems Group
- University of Portsmouth,
- http//dsg.port.ac.uk/mab/
2Outline
- Characterisation of the Grid.
- What is not a grid!
- Evolution of the Grid.
- Experiences with grid middleware.
- Comments on grid software.
- Observations and Summary.
- DSG Projects
- GridRM,
- jGMA,
- Semantic Logging,
- MPJ.
3Characterisation of the Grid
- In 1998, Ian Foster and Carl Kesselman provided
an initial definition in The Grid Blueprint for
a New Computing Infrastructure (see ref 1). - A computational grid is a hardware and software
infrastructure that provides dependable,
consistent, pervasive, and inexpensive access to
high-end computational capabilities." - This particular definition stems from the earlier
roots of the Grid, that of inter-connecting high
performance facilities at various US laboratories
and universities.
4Characterisation of the Grid
- Since this early definition there have been a
number of other attempts to define what a grid
is. - For example
- A grid is a software framework providing layers
of services to access and manage distributed
hardware and software resources (CCA - see ref
2). - widely distributed network of high-performance
computers, stored data, instruments, and
collaboration environments shared across
institutional boundaries (IPG - see ref 3).
5Characterisation of the Grid
- In 2001, Foster, Kesselman and Tuecke refined
their definition of a grid to - "co-ordinated resource sharing and problem
solving in dynamic, multi-institutional virtual
organizations" (see ref 4). - This latest definition is the one most commonly
used to day to abstractly define a grid.
6Characterisation of the Grid
- Foster later produced a checklist (see ref 5)
that could be used to help understand exactly
what can be identified as a grid system, three
parts - Co-ordinated resource sharing with no centralised
point of control and that the users resided
within different administrative domains. - If not true it is probably the case that this is
not a grid system! - Standard, open, general-purpose protocols and
interfaces - If not, it is unlikely that system components
will be able to communicate or inter-operate, and
it is likely that we are dealing with an
application-specific system, and not the Grid.
7Characterisation of the Grid
- Delivering non-trivial qualities of service -
here we are considering how the components that
make up a grid can be used in a co-ordinated way
to deliver combined services, which are
appreciably greater than sum of the individual
components. - These services may be associated with throughput,
response time, meantime between failure,
security, or many other facets.
8Characterisation of the Grid
- From a commercial view point, IBM define a grid
as -
- a standards-based application/resource sharing
architecture that makes it possible for
heterogeneous systems and applications to share
compute and storage resources transparently (see
ref 6).
9What is not a Grid!
- A cluster, a network attached storage device, a
desktop PC, a scientific instrument, a network
these are not grids - Each might be an important component of a grid,
but by itself, it does not constitute a grid. - Screen saver/cycle stealers
- SETI_at_HOME, fold_at_home, etc,
- Other application specific distributed computing.
- Most of the current Grid providers
- Proprietary technology with closed model of
operation. - Globus
- It is a toolkit to build a system that might work
as or within a grid. - Sun Grid Engine, Platform LSF and related.
- Most anything referred to as a Grid by marketeers!
10The Evolution of the Grid The First Generation
- The early to mid 1990s marks the emergence of the
early metacomputing or grid environments. - Typically, the objective of these early
metacomputing projects was to provide
computational resources to a range of high
performance applications. - Two representative projects in the vanguard of
this type of technology were FAFNER (see ref 7)
and I-WAY (see ref 8) both cica 1995.
11Convergence of Technologies
- Both projects attempted to provide metacomputing
resources from opposite ends of the computing
spectrum - FAFNER was Web-based for factoring the RSA
challenge, capable of running on any workstation
with more than 4 Mbytes of memory, and was a
aimed at a trivially parallel application. - IWAY was a means of unifying the resources of
large US supercomputing centres, and was targeted
at high-performance applications (compute/data
intensive). - Each project was in the vanguard of metacomputing
and helped pave the way for many of the
succeeding projects. - FAFNER was the forerunner of the likes of
SETI_at_home, fold_at_home and Distributed.Net, - I-WAY was the same for Globus, Legion, and
UNICORE.
12Convergence of Technologies
- Since the emergence of the second generation of
systems (e.g. Globus/Legion circa 1995) there
has been a number of classes of wide-area
systems that have been developed - Grid-based, aimed at HPC compute/data
intensive, e.g. Globus/Legion/UNICORE - Object-based, e.g. CORBA/CCA/Jini/Java-RMI
- Web, e.g. Javelin, seti_at_home, Charlotte,
fold_at_home, ParaWeb, distributed.net - Enterprise - bespoke systems, such as IBMs
WebSphere, BAEs WebLogic, and Microsofts .Net
platform.
13Convergence of Technologies
- The developers in these four areas, over the
years, evolved their systems there were many
overlaps, various collaborations started, and to
an extent, a realisation that a unified approach
to the development of middleware to support
wide-area applications was arrived at. - Unifying standards bodies helped this process
for example GGF,OASIS, W3C, and IETF. - Convergence of WS, HPC, OO, SOA, .
- A results of this was that the Open Grid Service
Architecture (OGSA) was announced at GGF4 in Feb
2002, and was declared their flagship
architecture in March 2004. - OGSA was based on Web Services technologies.
14Convergence of Technologies
- The OGSA document, first released at GGF11 in
June 2004, gave current thinking on the required
capabilities and was released in order to
stimulate further discussion. - Note instantiations of OGSA depends on emerging
specifications - Currently the OGSA document does not contain
sufficient in formation to develop an actual
implementation of OSGA-based system. - The first OGSA-based reference implementation was
GT3 OGSI, released in July 2003. - Major problems were identified with OGSI, some
where political and other were technical.
15Convergence of Technologies
- In Jan 2004, a significant shift happened when
WS-RF was announced. - Problems were identified with OGSI
- Re-implementation of a lot of layers which are
already standardised in commodity WS, for example
GSDL, - Felt too much in one specification,
- Did not work well with existing tooling for WS,
- Too OO!
- Whereas with WS-RF
- New mechanisms build on top of existing WS
standards and adds a few, - Basically rebuilding OGSI functionality using WS
tooling, extending where necessary, - Dependant on six new or emerging WS
specifications!
16Grid and Web ServicesConvergence!
Grid
GT1
GT2
OGSI
Started far apart
WSRF
WSDL 2, WSDM
WSDL, WS-
Web
HTTP
WSRF means that Grid and Web communities are
moving forward on a common base!
17Emerging Grid Standards
Latest issue of IEEE Computer
18Emerging Grid Standards
19Experiences with the Grid
- Background
- First installed Globus at Portsmouth back in
early 2000 GT1. - Developed monitoring system based on Globus
MDS, and Liquid Crystal Portal, - Oct 2003 funded to lead the OGSA Testbed
- Consortium of Daresbury, Manchester, Reading and
Westminster, - Funded to explore, investigate and feedback our
experiences installing, maintaining and using
OGSI (GT3/OGSILite) and deploying our
applications across the testbed, - Details at http//dsg.port.ac.uk/projects/ogsa-tes
tbed/.
20The OGSA Testbed Project
21Recap - Core Globus Services
- GridFTP - high-performance, secure, reliable data
transfer protocol for wide-area networks. - GRAM (Globus Resource Allocation Manager)
provides a standard interface for requesting and
using remote resources for the execution of
"jobs". - The most common use is remote job submission and
control. - MDS (Monitoring and Discovery System) is the
information services component and provides
information about the available resources and
their status. - GSI (Grid Security Infrastructure) for secure
authentication and communication over an open
network. - GSI provides a number of useful services for
Grids, including mutual authentication and single
sign-on.
22Experiences with Globus
- Documentation
- 3.2 installation guide is better, lt3.0 was a
nightmare. - Earlier documents had gaps which were glossed
over and things did not happen for us as the docs
described. - Size of install (GT3)
- 251 Mbytes for 3.0.2, 320 Mbytes for 3.2.
- Time to compile
- 6 hours on a 1 GHz 256 Mbyte PC,
- 2 hours on dual 2.8GHz with 2 Gbytes of RAM.
- Setting up GT security and certificates
- Getting e-Science certificate OK,
- gridmap file an ACL, fairly easy, problem is
you need to hand edit file - For small organisations with few users this is
fine, but many users means more work need to
add GridPP patch.
23Experiences with Globus
- Test programs
- Yes, but they do not test whether a service is
functioning correctly. - We used/developed GT3GITS scripts.
- Bugs/features - reporting!
- Yes, via http//bugzilla.globus.org/globus/
- Week commencing March 20th 2005 - 25 bugs (just
Monday), marked as new 241 ... - Total unresolved 438 and resolved 1950 just for
globus.org. - Issue with which ones get prioritised
- If you ask a non-standard (newbie) question on
the mailing list we never got a useful reply,
just lots of people saying "yep same problem
here". - Application versus installation answers.
24Experiences with Globus
- Hardwired software, pinned to a platform!?
- Pretty good now for Java!
- Usually works for 32-bit platforms, 64-bit
platforms, like IBM SP/HP, painful - Works on one Linux platform, but not another!
- Strange how it did not work on some
distributions, and worked better on Debian than
some versions of Redhat. - No where near as good as most portable projects
(e.g. Apache) which builds on everything,
correctly. - Implications of frequent updates and reinstalls
- Often a complete rebuild/reinstall,
- Software not backwardly compatible,
- No direct path or bridge from GT 2.x ? 3.2 ? 4.0.
25Experiences with Globus
- Opening ports many open!
- Globus container 8080,
- Gatekeeper 2119,
- Grid ftp - 2811 a range of TCP ports (roughly
256 ports as recommended by UK grid-support
centre). - Makes systems people VERY unhappy!
- Apache Tomcat as service container
- GT comes with Tomcat this was a just
development environment, - Needed to deploy GT in Tomcat container
- Memory problems, GC having problems, eventually
failed, - Sorted out issues and now been running
continuously for 5 months. - Needed to figure out ways of working with Tomcat
and GT.
26Other Experiences
- Portsmouth firewall committee decided to stop
access to FTP on University systems and also
updated the firewall system - GridFTP stopped working!
- Took two weeks to convince systems people that
the GridFTP was not working and was secure! - OGSA-DAI middleware based on OGSI for accessing
distributed databases. - Natural to run examples first to test all was
well, so we did - One did not run, had feature that we were
informed will be fixed in the next release! One
contained bugs, - Confused roadmap now! Trying to support to many
grid platforms. - January 2004 refactoring exercise
- OGSI to WS-RF,
- No consultation, a slight hiccup!
27Summary of Experiences
- Globus is ambitious effort to produce middleware
that satisfies the needs of wide-area distributed
applications. - Good for people who are familiar with GT - like
us now, but, its total disaster for a newbie - Expect application scientist to have tech
knowhow! - Very steep learning curve.
- Globus is a worthy effort, but it is still
research software, with all the implications of
such. - Many projects are staying with GT2.4, as this
provides a more stable platform. - No new services developed over the last few
years. - DataGrid/EGEE are having a significant affect on
future grid middleware offerings
28Other Observations
- In UK, went to GT2, to early (probably
12months), GT3 deprecated, now awaiting GT4
early 2005. - OGSI ? WS-RF, done for the right reasons, but
announcement confounded the community and
frustrated many developers. - GT is not production quality software yet, so
expect the associated problems. - GSI is success, being used widely by the
community. - Need alternative OGSA instantiations, emerging
systems such as WSRFLite and UNICORE will help
this diversity. - Need hardened and usable software, otherwise
the Grid will encounter its own AI Winter. - UK OMII addressing this area.
29Other Observations
- Need money to develop robust middleware
infrastructure, not just money to do further
research in future infrastructure and
applications - Question Is a research council (e.g. UK EPSRC)
the right place to allocate these funds from!? - Currently much confusion as to which standard
to follow - WS-RF GT4 - 1st 2nd Quarter 2005,
- WS-GAF,
- WS/WS-I/WS-I.
- Many developers in the UK are using just Web
Services mainly SOAP WSDL. - UDDI does not satisfy the needs for a grid
information service.
30DSG Projects
- The development of a selection grid and cluster
middleware.
31DSG Projects
- GridRM a unifying resource monitoring system,
capable of be used a number of diverse purposes
including scheduling, performance, faults, and
policing QoS or SLA. - jGMA a event-based messaging system with an
integrated P2P-based registry. - Semantic Logging RDF-based system unifying and
annotating log data for a more complete
analysis of distributed systems. - MPJ Java MPI-based message-passing systems and
runtime infrastructure. - Others Portals, OGSA-DAI, investigation of
NaradaBroker can mention further if interested!
32GridRM
- A data gathering framework for monitoring and
managing the Grid - http//gridrm.org/
33Background
- Lack of knowledge about the status of the
resources in any distributed system will hamper
strategies for optimal scheduling, allocation and
usage. - There is a need for a ubiquitous framework that
provides information about the health and status
of Grid resources - Gathering resource information, such as
- Compute (nodes, CPU, memory),
- Network (inter-site communications links, network
devices), - Sensors (specialised devices, Web cam,
microphone), - Software services (information services,
schedulers). - Need a generic system that does not need another
local agent, but can utilise whatever exists - SNMP, Network Weather Service, NetLogger,
Ganglia, /proc, MDS, or other services
34GridRM Structure
- Global layer of peer-related gateways
- Which in turn have a local layer that interacts
with the local data sources, and/or a hierarchy
child gateways.
35GridRM Architecture
36GridRM Local Layer
37GridRM Layered View
38GridRM Query API
- Producing an API is fairly simple, but creating
one that will be taken up and accepted is another
matter. - We are using an API based on JDBC from Java.
- Example of API
- Agent Driver Interface
- forName(GridRM.sql.agent.NWSDriver)
- forName(GridRM.sql.agent.SNMPv1Driver)
- Connection Interface
- String agentURL GridRMNWS/barney5550/PerfDat
a - Connection con DriverManager.getConnection(agent
URL) - Statement Interface
- Statement stmt con.createStatement()
- ResultSet rs stmt.executeQuery(get CPU table)
- Manipulating Results
- ResultSet is another interface contains a handful
of methods for manipulating the data returned
from the agent.
39GridRM Naming Schema
- No single naming schema for this area at the
moment. - We needed something that can markup the
information that can be gathered by the local
agents - Static and dynamic information
- Name/IP/OS/Processor/NIC/
- CPU load/memory available/disk space/network/
- Did not want to produce our own schema, so choose
an emerging one that is increasingly being used -
Grid Laboratory Uniform Environment (GLUE)
schema - A schema that defines the attributes of computer
system resources (CE/NE/) - Others, CIM, UNICORE, etc..
40GridRM Drivers and Manager
- The GridRM Driver Manager gets data from the
Agent API and translates it into something that
the local agents can understand - The Driver Manager also provide other
functionality that is particular to GridRM such
as configuration, caching, streaming or
pushing/pulling data to/from clients. - The driver manager includes a simple low-level
API to interact with the local agents based on
a common sub-set of information that can be
retrieved from all the agents.
41Local Layer Use of SQL
- SQL used extensively throughout the framework.
- All resources are seen as databases and queried
using SQL. - Resource queries enter the framework as SQL
syntax. - Pluggable resource drivers are implemented as
JDBC drivers - Translate SQL requests into native protocol.
- Normalise results according to selected schema.
- Framework benefits from a single, flexible
approach to resource interaction. - Makes for a simple, extensible framework.
42GridRM GUI
Homogeneous view of the data sources
43GridRM Portal
- The GridRM Portal (gridrm.org) is a demonstration
of gateways, data sources, SQL and data
normalisation. - An example of the use of GridRM, particularly its
ability to discover and utilise resource data. - An example of a GridRM client which
- Allows the use of GridRM with no knowledge of the
underlying technologies. - Hides details like SQL, XML, etc
- Provides an abstraction everyone can use
clickerity click!).
44GridRM Portal
45International Testbed
46GridRM
47GridRM
48Summary
- Heterogeneous information returned from a diverse
range of possible data sources. - Need to harvest data into a homogeneous form
- Hide underlying complexity from clients.
- Provide data in a format that meets a clients
requirements. - Combine legacy resources with modern cluster and
Grid information servers to provide - An over-arching grid information system.
- Independent of particular middleware and
services. - GridRM promotes homogeneity through
- JDBC-like data source driver,
- Standard SQL syntax,
- The GLUE naming schemas,
- Request translation and result normalisation,
49Future Work
- Provide an example of a job submission system
using GridRM, several options - Other schedulers, Condor, SGE,
- Further security
- Integrate UK e-Science certificates for resource
access control. - Secure interface for remote Gateway
administration. - Performance and scalability testing.
- More translation schema for different resource
- DBMS, telescope, surf conditions!
- Use of portlet technologies to provide a better
Web interface - GridSphere
50jGMA
- A event-based messaging system
- http//dsg.port.ac.uk/projects/jGMA/
51jGMA
- Needed a lightweight implementation of the GGF
Grid Monitoring Architecture in Java for GridRM. - There are others
- R-GMA,
- pyGMA,
- Autopilot, MDS, NWS, CODE
- Found that existing systems were heavyweight,
complex or not standalone. - Decided to produce our own version
- Aims
- GMA compliant,
- Easy to install and use,
- Easy to program and extend,
- Java-based.
52jGMA Architecture
- GMA Compliance
- 21 features,
- GGF document is only a guide,
- It is very easy to claim to be compliant,
- For now jGMA is GMA like.
53jGMA Infrastructure
54jGMA Demo
55jGMA Demo
56jGMA Status and Future Work
- jGMA messaging API complete.
- Currently completing the virtual registry
- Text file/mySQL interfaces complete,
- Implementing the P2P part.
- Testing implementation in June versus
NaradaBroker and R-GMA. - jGMA v1 can be download from - http//dsg.port.ac.
uk/projects/jGMA/ - A couple of demos linked to the web page.
- Applying jMGA to GridRM, myGrid applications, and
eventually as on-line gaming infrastructure.
57jGMA
58Semantic Logging
- Semantic Logging using RDF
- http//dsg.port.ac.uk/projects/UISB/
59UISB
- Had a desire to investigate Semantic Web
technologies for the purposes of unifying
Information Services LDAP/MDS/LUS/UDDI used
RDF as in information store. - We can harvest and annotate IS data and store in
in a centralised RDF store. - Initially funded by an IBM Innovation award
used the Eclipse platform. - Have developed all the components needed and
discovered a number of hurdles and issues.
60UISB
61UISB
62Semantic Logging
- UISB project has diverged.
- There was a keen interest to investigate the idea
that UISB components could be used to unify logs
events from various sources and provide a better
overall source of information for analysis of the
behaviour of distributed systems and
applications. - The idea is we harvest log data (events) from
the OS, executing middleware and applications
from a range of systems, store in our RDF-based
repository, and then visualise ALL the various
events in the logs in order to better understand
the systems overall behaviour.
63Semantic Logging RDF view!
64Semantic Logging A View of Events
65MPJ
- A Java-based message passing system
- http//dsg.port.ac.uk/projects/MPJ/
66Introduction
- A lot of interest in a Java messaging system
- Wanted to produce a reference pure Java messaging
system that follows the MPJ API specification - Create an MPJ implementation which is the
corollary of MPICH. - What a Java messaging system has to offer?
- Portability
- Write once run anywhere.
- Object oriented programming concepts
- Higher level of abstraction for parallel
programming, - A extensive set of API libraries
- Avoids reinventing the wheel.
- Multi-threaded language
- Thread-safe.
- Automatic memory management.
- Popularity first language and therefore good
for teaching MP as well.
67MPJ
68The New Design
69(No Transcript)
70MPJ Status
- MPJ API complete and being tested,
- MPJ runtime infrastructure being developed want
something that works the same on UNIX/Linux and
Windows! - Installation being looked at!
- Further devices (SHMEM).
- Release of beta version in June.
- Applications being ported.
71Other Projects
- Grid integrations tests.
- Optimisation of complex distributed queries using
OGSA-DAI based on GIS/SDSS data. - Investigation of NaradaBroker leading to the
development of P2P file store. - Portal Work for JISC development of a range of
JSR-168 compliant services.
72Summary
- The DSG is involved in a range of projects that
are developing middleware for clusters and the
Grid. - Attempting to use generally accepted and widely
used standards may of those being purported
today are ephemeral! - Want to create relatively simple and easy to use
software not trying to reinvent the wheel,
which seems a common happening.
73Shameless Plug
http//www.amazon.co.uk/exec/obidos/ASIN/047009417
6/qid3D1113207878/202-7878523-7639008
74The End!
75References
- Ian Foster and Carl Kesselman (Editors), The
Grid Blueprint for a New Computing
Infrastructure, published by Morgan Kaufmann
Publishers 1st edition (November 1, 1998), ISBN
1558604758 - CCA, http//www.extreme.indiana.edu/ccat/glossary.
html - IPG, http//www.ipg.nasa.gov/ipgflat/aboutipg/glos
sary.html - I. Foster, C. Kesselman, and S. Tuecke, The
Anatomy of the Grid Enabling Scalable Virtual
Organizations, International J. Supercomputer
Applications, 15(3), 2001.
76References
- Checklist, http//www.gridtoday.com/02/0722/100136
.html - IBM Grid Computing, http//www-1.ibm.com/grid/grid
_literature.shtml - FAFNER, http//www.npac.syr.edu/factoring.html
- I. Foster, J. Geisler, W. Nickless, W. Smith, S.
Tuecke Software Infrastructure for the I-WAY
High Performance Distributed Computing
Experiment in Proc. 5th IEEE Symposium on High
Performance Distributed Computing. pp. 562-571,
1997. - LCG, http//lcg.web.cern.ch/LCG/
- WS-GAF, http//www.neresc.ac.uk/ws-gaf
- WS-I, http//www.ws-i.org
- WS-RF, http//www.globus.org/wsrf