NorduGrid: the light-weight Grid solution - PowerPoint PPT Presentation

About This Presentation
Title:

NorduGrid: the light-weight Grid solution

Description:

A snapshot. 2003-10-23. oxana.smirnova_at_hep.lu.se www.nordugrid.org. 6 ... Current stable release is 0.3.28; daily CVS snapshots are available. 2003-10-23 ... – PowerPoint PPT presentation

Number of Views:31
Avg rating:3.0/5.0
Slides: 17
Provided by: oxanasm
Category:

less

Transcript and Presenter's Notes

Title: NorduGrid: the light-weight Grid solution


1
NorduGrid the light-weight Grid solution
  • LCSC 2003Linköping, October 23, 2003Oxana
    Smirnova

2
Some facts
  • NorduGrid is
  • A Globus-based Grid middleware solution for Linux
    clusters
  • A large international 24/7 production quality
    Grid facility
  • A resource routinely used by researchers since
    summer 2002
  • A freely available software
  • A project in development
  • NorduGrid is NOT
  • Derived from other Grid solutions (e.g. EU
    DataGrid)
  • An application-specific tool
  • A testbed anymore
  • A finalized solution

3
Some history
  • Initiated by several Nordic universities
  • Copenhagen, Lund, Stockholm, Oslo, Bergen,
    Helsinki
  • Started in January 2001
  • Initial budget 2 years, 3 new positions
  • Initial goal to deploy EU DataGrid middleware to
    run ATLAS Data Challenge
  • Cooperation with EU DataGrid
  • Common Certification Authority and Virtual
    Organization tools, Globus2 configuration
  • Common applications (high-energy physics
    research)
  • Switched from deployment to RD in February 2002
  • Forced by the necessity to execute ATLAS Data
    Challenges
  • Deployed a light-weight and yet reliable and
    robust Grid solution in time for the ATLAS DC
    tests in May 2002
  • Will continue for 4-5 years more (and more?..)
  • Form the North European Grid Federation
    together with the Dutch Grid, Belgium and Estonia
  • Will provide middleware for the Nordic Data Grid
    Facility
  • as well as for the Swedish Grid facility
    SWEGRID, Danish Center for Grid Computing,
    Finnish Grid projects etc

4
The resources
  • Almost everything the Nordic academics can
    provide (ca 1000 CPUs in total)
  • 4 dedicated test clusters (3-4 CPUs)
  • Some junkyard-class second-hand clusters (4 to 80
    CPUs)
  • Few university production-class facilities (20 to
    60 CPUs)
  • Two world-class clusters in Sweden, listed in
    Top500 (238 and 398 CPUs)
  • Other resources come and go
  • Canada, Japan test set-ups
  • CERN, Dubna clients
  • Its open so far, anybody can join or part
  • Number of other installations unknown
  • People
  • the core team keeps growing
  • local sysadmins are only called up when users
    need an upgrade

5
A snapshot
6
NorduGrid specifics
  • It is stable by design
  • The nervous system distributed yet stable
    Information System (Globus MDS 2.2patches)
  • The heart(s) Grid Manager, the service to be
    installed at master nodes (based on Globus,
    replaces GRAM)
  • The brain(s) User Interface, the client/broker
    that can be installed anywhere as a standalone
    module (makes use of Globus)
  • It is light-weight, portable and non-invasive
  • Resource owners retain full control Grid Manager
    is effectively a yet another user (with many
    faces though)
  • Nothing has to be installed on worker nodes
  • No requirements w.r.t. OS, resource
    configuration, etc.
  • Clusters need not be dedicated
  • Runs on top of existing Globus installation (e.g.
    VDT)
  • Works with any Linux flavor, Solaris, Tru64
  • Strategy start with something simple that works
    for users and add functionality gradually

7
How does it work
  • Information system knows everything
  • Substantially re-worked and patched Globus MDS
  • Distributed and multi-rooted
  • Allows for a pseudo-mesh topology
  • No need for a centralized broker
  • The server (Grid manager) on each gatekeeper
    does most of the job
  • Pre- and post- stages files
  • Interacts with LRMS
  • Keeps track of job status
  • Cleans up the mess
  • Sends mails to users
  • The client (User Interface) does the brokering,
    Grid job submission, monitoring, termination,
    retrieval, cleaning etc
  • Interprets users job task
  • Gets the testbed status from the information
    system
  • Forwards the task to the best Grid Manager
  • Does some file uploading, if requested

8
Information System
  • Uses Globus MDS 2.2
  • Soft-state registration allows creation of any
    dynamic structure
  • Multi-rooted tree
  • GIIS caching is not used by the clients
  • Several patches and bug fixes are applied
  • A new schema is developed, to serve clusters
  • Clusters are expected to be fairly homogeneous

9
Front-end and the Grid Manager
  • Grid Manager replaces Globus GRAM, still using
    Globus ToolkitTM 2 libraries
  • All transfers are made via GridFTP
  • Added a possibility to pre- and post-stage files,
    optionally using Replica Catalog information
  • Caching of pre-staged files is enabled
  • Runtime environment support

10
Summary of Grid services on the front-end machine
  • GridFTP server
  • Plugin for job submission via a virtual directory
  • Conventional file access with Grid access control
  • LDAP server for information services
  • Grid Manager

11
The User Interface
  • Provides a set of utilities to be invoked from
    the command line
  • Contains a broker that polls MDS and decides to
    which queue at which cluster a job should be
    submitted
  • The user must be authorized to use the cluster
    and the queue
  • The clusters and queues characteristics must
    match the requirements specified in the xRSL
    string (max CPU time, required free disk space,
    installed software etc)
  • If the job requires a file that is registered in
    a Replica Catalog, the brokering gives priority
    to clusters where a copy of the file is already
    present
  • From all queues that fulfills the criteria one is
    chosen randomly, with a weight proportional to
    the number of free CPUs available for the user in
    each queue
  • If there are no available CPUs in any of the
    queues, the job is submitted to the queue with
    the lowest number of queued job per processor

ngsub to submit a task
ngstat to obtain the status of jobs and clusters
ngcat to display the stdout or stderr of a running job
ngget to retrieve the result from a finished job
ngkill to cancel a job request
ngclean to delete a job from a remote cluster
ngrenew to renew users proxy
ngsync to synchronize the local job info with the MDS
ngcopy to transfer files to, from and between clusters
ngremove to remove files
12
Job Description extended Globus RSL
  • ((executable"recon.gen.v5.NG")
  • (arguments"dc1.002000.lumi02.01101.hlt.pythia_jet
    _17.zebra" "dc1.002000.lumi02.recon.007.01101.hlt.
    pythia_jet_17.eg7.602.ntuple" "eg7.602.job"
    999")
  • (stdout"dc1.002000.lumi02.recon.007.01101.hlt.pyt
    hia_jet_17.eg7.602.log")
  • (stdlog"gridlog.txt")(join"yes")
  • (
  • (((cluster"farm.hep.lu.se")(cluster"lscf.nbi.
    dk")(cluster"seth.hpc2n.umu.se")(cluster"login
    -3.monolith.nsc.liu.se"))
  • (inputfiles ("dc1.002000.lumi02.01101.hlt.pythi
    a_jet_17.zebra" "rc//grid.uio.no/lcdc1.lumi02.00
    2000,rcNorduGrid,dcnordugrid,dcorg/zebra/dc1.00
    2000.lumi02.01101.hlt.pythia_jet_17.zebra")
    ("recon.gen.v5.NG" "http//www.nordugrid.org/appl
    ications/dc1/recon/recon.gen.v5.NG.db")
    ("eg7.602.job" "http//www.nordugrid.org/applicat
    ions/dc1/recon/eg7.602.job.db") ("noisedb.tgz"
    "http//www.nordugrid.org/applications/dc1/recon/n
    oisedb.tgz"))
  • )
  • (inputfiles ("dc1.002000.lumi02.01101.hlt.pythi
    a_jet_17.zebra" "rc//grid.uio.no/lcdc1.lumi02.00
    2000,rcNorduGrid,dcnordugrid,dcorg/zebra/dc1.00
    2000.lumi02.01101.hlt.pythia_jet_17.zebra")
    ("recon.gen.v5.NG" "http//www.nordugrid.org/appli
    cations/dc1/recon/recon.gen.v5.NG")
    ("eg7.602.job" "http//www.nordugrid.org/applicat
    ions/dc1/recon/eg7.602.job"))
  • )
  • (outputFiles ("dc1.002000.lumi02.recon.007.0110
    1.hlt.pythia_jet_17.eg7.602.log"
    "rc//grid.uio.no/lcdc1.lumi02.recon.002000,rcNo
    rduGrid,dcnordugrid,dcorg/log/dc1.002000.lumi02.
    recon.007.01101.hlt.pythia_jet_17.eg7.602.log")
    ("histo.hbook" "rc//grid.uio.no/lcdc1.lumi02.r
    econ.002000,rcNorduGrid,dcnordugrid,dcorg/histo
    /dc1.002000.lumi02.recon.007.01101.hlt.pythia_jet_
    17.eg7.602.histo") ("dc1.002000.lumi02.recon.00
    7.01101.hlt.pythia_jet_17.eg7.602.ntuple"
    "rc//grid.uio.no/lcdc1.lumi02.recon.002000,rcNo
    rduGrid,dcnordugrid,dcorg/ntuple/dc1.002000.lumi
    02.recon.007.01101.hlt.pythia_jet_17.eg7.602.ntupl
    e"))
  • (jobname"dc1.002000.lumi02.recon.007.01101.hlt.py
    thia_jet_17.eg7.602")
  • (runTimeEnvironment"ATLAS-6.0.2")
  • (CpuTime1440)(Disk3000)(ftpThreads10))

13
Task flow
Cluster A
RC
SE
Gatekeeper GridFTP
SE
Front-end
UIRB
Grid Manager
Cluster B
MDS
14
Performance
  • The main load ATLAS Data Challenge 1 (DC1)
  • April 5th 2002 first job submitted
  • May 10th 2002 first pre-DC1-validation-job
  • End-May 2002 now clear that the system is mature
    enough to do and manage real production.
  • DC1, phase1 (detector simulation)
  • Total number of jobs 1300, ca. 24 hours of
    processing 2 GB of input each
  • Total output size 762 GB
  • All files uploaded to Storage Elements and
    registered in the Replica Catalog.
  • DC1, phase2 (pile-up of data)
  • Piling up the events above with a background
    signal
  • 1300 jobs, ca. 4 hours each
  • DC1, phase3 (reconstruction of signal)
  • 2150 jobs, 5-6 hours of processing 1 GB of input
    each
  • Other applications
  • Calculations for string fragmentation models
    (Quantum Chromodynamics)
  • Quantum lattice models calculations (sustained
    load of 150 long jobs at any given moment for
    several days)
  • Particle physics analysis and modeling
  • At peak production, up to 500 jobs were managed
    by the NorduGrid at the same time

15
What is needed for installation
  • A cluster or even a single machine
  • For a server
  • Any Linux flavor (binary RPMs exist for RedHat
    and Mandrake, ev. for Debian)
  • A local resource management system, e.g., PBS
  • Globus installation (NorduGrid has an own
    distribution in a single RPM)
  • Host certificate (and user certificates)
  • Some open ports (depends on the cluster size)
  • One day to go through all the configuration
    details
  • The owner always retains a full control
  • Installing NorduGrid does not give automatic
    access to the resources
  • And other way around
  • But with a bit of negotiations, one can get
    access to very considerable resources on a very
    good network
  • Current stable release is 0.3.28 daily CVS
    snapshots are available

16
Summary
  • NorduGrid pre-release (currently 0.3.28) works
    reliably
  • Release 1.0 is slowly but surely on its way many
    fixes are still needed
  • We welcome developers much functionality is
    still missing, such as
  • Bookkeeping, accounting
  • Group- and role-based authorization
  • Scalable resource discovery and monitoring
    service
  • Interactive tasks
  • Integrated, scalable and reliable data management
  • Interfaces to other resource management systems
  • We welcome new users and resources
  • Nordic Data Grid Facility will provide support
Write a Comment
User Comments (0)
About PowerShow.com