NorduGrid: the light-weight Grid solution - PowerPoint PPT Presentation

About This Presentation

Title:

NorduGrid: the light-weight Grid solution

Description:

A snapshot. 2003-10-23. oxana.smirnova_at_hep.lu.se www.nordugrid.org. 6 ... Current stable release is 0.3.28; daily CVS snapshots are available. 2003-10-23 ... – PowerPoint PPT presentation

Number of Views:31

Avg rating:3.0/5.0

Slides: 17

Provided by: oxanasm

Category:

more less

Transcript and Presenter's Notes

Title: NorduGrid: the light-weight Grid solution

1
NorduGrid the light-weight Grid solution

LCSC 2003Linköping, October 23, 2003Oxana
Smirnova

2
Some facts

NorduGrid is
A Globus-based Grid middleware solution for Linux
clusters
A large international 24/7 production quality
Grid facility
A resource routinely used by researchers since
summer 2002
A freely available software
A project in development
NorduGrid is NOT
Derived from other Grid solutions (e.g. EU
DataGrid)
An application-specific tool
A testbed anymore
A finalized solution

3
Some history

Initiated by several Nordic universities
Copenhagen, Lund, Stockholm, Oslo, Bergen,
Helsinki
Started in January 2001
Initial budget 2 years, 3 new positions
Initial goal to deploy EU DataGrid middleware to
run ATLAS Data Challenge
Cooperation with EU DataGrid
Common Certification Authority and Virtual
Organization tools, Globus2 configuration
Common applications (high-energy physics
research)
Switched from deployment to RD in February 2002
Forced by the necessity to execute ATLAS Data
Challenges
Deployed a light-weight and yet reliable and
robust Grid solution in time for the ATLAS DC
tests in May 2002
Will continue for 4-5 years more (and more?..)
Form the North European Grid Federation
together with the Dutch Grid, Belgium and Estonia
Will provide middleware for the Nordic Data Grid
Facility
as well as for the Swedish Grid facility
SWEGRID, Danish Center for Grid Computing,
Finnish Grid projects etc

4
The resources

Almost everything the Nordic academics can
provide (ca 1000 CPUs in total)
4 dedicated test clusters (3-4 CPUs)
Some junkyard-class second-hand clusters (4 to 80
CPUs)
Few university production-class facilities (20 to
60 CPUs)
Two world-class clusters in Sweden, listed in
Top500 (238 and 398 CPUs)
Other resources come and go
Canada, Japan test set-ups
CERN, Dubna clients
Its open so far, anybody can join or part
Number of other installations unknown
People
the core team keeps growing
local sysadmins are only called up when users
need an upgrade

5
A snapshot
6
NorduGrid specifics

It is stable by design
The nervous system distributed yet stable
Information System (Globus MDS 2.2patches)
The heart(s) Grid Manager, the service to be
installed at master nodes (based on Globus,
replaces GRAM)
The brain(s) User Interface, the client/broker
that can be installed anywhere as a standalone
module (makes use of Globus)
It is light-weight, portable and non-invasive
Resource owners retain full control Grid Manager
is effectively a yet another user (with many
faces though)
Nothing has to be installed on worker nodes
No requirements w.r.t. OS, resource
configuration, etc.
Clusters need not be dedicated
Runs on top of existing Globus installation (e.g.
VDT)
Works with any Linux flavor, Solaris, Tru64
Strategy start with something simple that works
for users and add functionality gradually

7
How does it work

Information system knows everything
Substantially re-worked and patched Globus MDS
Distributed and multi-rooted
Allows for a pseudo-mesh topology
No need for a centralized broker
The server (Grid manager) on each gatekeeper
does most of the job
Pre- and post- stages files
Interacts with LRMS
Keeps track of job status
Cleans up the mess
Sends mails to users
The client (User Interface) does the brokering,
Grid job submission, monitoring, termination,
retrieval, cleaning etc
Interprets users job task
Gets the testbed status from the information
system
Forwards the task to the best Grid Manager
Does some file uploading, if requested

8
Information System

Uses Globus MDS 2.2
Soft-state registration allows creation of any
dynamic structure
Multi-rooted tree
GIIS caching is not used by the clients
Several patches and bug fixes are applied
A new schema is developed, to serve clusters
Clusters are expected to be fairly homogeneous

9
Front-end and the Grid Manager

Grid Manager replaces Globus GRAM, still using
Globus ToolkitTM 2 libraries
All transfers are made via GridFTP
Added a possibility to pre- and post-stage files,
optionally using Replica Catalog information
Caching of pre-staged files is enabled
Runtime environment support

10
Summary of Grid services on the front-end machine

GridFTP server
Plugin for job submission via a virtual directory
Conventional file access with Grid access control
LDAP server for information services
Grid Manager

11
The User Interface

Provides a set of utilities to be invoked from
the command line
Contains a broker that polls MDS and decides to
which queue at which cluster a job should be
submitted
The user must be authorized to use the cluster
and the queue
The clusters and queues characteristics must
match the requirements specified in the xRSL
string (max CPU time, required free disk space,
installed software etc)
If the job requires a file that is registered in
a Replica Catalog, the brokering gives priority
to clusters where a copy of the file is already
present
From all queues that fulfills the criteria one is
chosen randomly, with a weight proportional to
the number of free CPUs available for the user in
each queue
If there are no available CPUs in any of the
queues, the job is submitted to the queue with
the lowest number of queued job per processor

ngsub to submit a task
ngstat to obtain the status of jobs and clusters
ngcat to display the stdout or stderr of a running job
ngget to retrieve the result from a finished job
ngkill to cancel a job request
ngclean to delete a job from a remote cluster
ngrenew to renew users proxy
ngsync to synchronize the local job info with the MDS
ngcopy to transfer files to, from and between clusters
ngremove to remove files
12
Job Description extended Globus RSL

((executable"recon.gen.v5.NG")
(arguments"dc1.002000.lumi02.01101.hlt.pythia_jet
_17.zebra" "dc1.002000.lumi02.recon.007.01101.hlt.
pythia_jet_17.eg7.602.ntuple" "eg7.602.job"
999")
(stdout"dc1.002000.lumi02.recon.007.01101.hlt.pyt
hia_jet_17.eg7.602.log")
(stdlog"gridlog.txt")(join"yes")
(
(((cluster"farm.hep.lu.se")(cluster"lscf.nbi.
dk")(cluster"seth.hpc2n.umu.se")(cluster"login
-3.monolith.nsc.liu.se"))
(inputfiles ("dc1.002000.lumi02.01101.hlt.pythi
a_jet_17.zebra" "rc//grid.uio.no/lcdc1.lumi02.00
2000,rcNorduGrid,dcnordugrid,dcorg/zebra/dc1.00
2000.lumi02.01101.hlt.pythia_jet_17.zebra")
("recon.gen.v5.NG" "http//www.nordugrid.org/appl
ications/dc1/recon/recon.gen.v5.NG.db")
("eg7.602.job" "http//www.nordugrid.org/applicat
ions/dc1/recon/eg7.602.job.db") ("noisedb.tgz"
"http//www.nordugrid.org/applications/dc1/recon/n
oisedb.tgz"))
)
(inputfiles ("dc1.002000.lumi02.01101.hlt.pythi
a_jet_17.zebra" "rc//grid.uio.no/lcdc1.lumi02.00
2000,rcNorduGrid,dcnordugrid,dcorg/zebra/dc1.00
2000.lumi02.01101.hlt.pythia_jet_17.zebra")
("recon.gen.v5.NG" "http//www.nordugrid.org/appli
cations/dc1/recon/recon.gen.v5.NG")
("eg7.602.job" "http//www.nordugrid.org/applicat
ions/dc1/recon/eg7.602.job"))
)
(outputFiles ("dc1.002000.lumi02.recon.007.0110
1.hlt.pythia_jet_17.eg7.602.log"
"rc//grid.uio.no/lcdc1.lumi02.recon.002000,rcNo
rduGrid,dcnordugrid,dcorg/log/dc1.002000.lumi02.
recon.007.01101.hlt.pythia_jet_17.eg7.602.log")
("histo.hbook" "rc//grid.uio.no/lcdc1.lumi02.r
econ.002000,rcNorduGrid,dcnordugrid,dcorg/histo
/dc1.002000.lumi02.recon.007.01101.hlt.pythia_jet_
17.eg7.602.histo") ("dc1.002000.lumi02.recon.00
7.01101.hlt.pythia_jet_17.eg7.602.ntuple"
"rc//grid.uio.no/lcdc1.lumi02.recon.002000,rcNo
rduGrid,dcnordugrid,dcorg/ntuple/dc1.002000.lumi
02.recon.007.01101.hlt.pythia_jet_17.eg7.602.ntupl
e"))
(jobname"dc1.002000.lumi02.recon.007.01101.hlt.py
thia_jet_17.eg7.602")
(runTimeEnvironment"ATLAS-6.0.2")
(CpuTime1440)(Disk3000)(ftpThreads10))

13
Task flow
Cluster A
RC
SE
Gatekeeper GridFTP
SE
Front-end
UIRB
Grid Manager
Cluster B
MDS
14
Performance

The main load ATLAS Data Challenge 1 (DC1)
April 5th 2002 first job submitted
May 10th 2002 first pre-DC1-validation-job
End-May 2002 now clear that the system is mature
enough to do and manage real production.
DC1, phase1 (detector simulation)
Total number of jobs 1300, ca. 24 hours of
processing 2 GB of input each
Total output size 762 GB
All files uploaded to Storage Elements and
registered in the Replica Catalog.
DC1, phase2 (pile-up of data)
Piling up the events above with a background
signal
1300 jobs, ca. 4 hours each
DC1, phase3 (reconstruction of signal)
2150 jobs, 5-6 hours of processing 1 GB of input
each
Other applications
Calculations for string fragmentation models
(Quantum Chromodynamics)
Quantum lattice models calculations (sustained
load of 150 long jobs at any given moment for
several days)
Particle physics analysis and modeling
At peak production, up to 500 jobs were managed
by the NorduGrid at the same time

15
What is needed for installation

A cluster or even a single machine
For a server
Any Linux flavor (binary RPMs exist for RedHat
and Mandrake, ev. for Debian)
A local resource management system, e.g., PBS
Globus installation (NorduGrid has an own
distribution in a single RPM)
Host certificate (and user certificates)
Some open ports (depends on the cluster size)
One day to go through all the configuration
details
The owner always retains a full control
Installing NorduGrid does not give automatic
access to the resources
And other way around
But with a bit of negotiations, one can get
access to very considerable resources on a very
good network
Current stable release is 0.3.28 daily CVS
snapshots are available