OxGrid, a campus grid environment for Oxford - PowerPoint PPT Presentation

1 / 32
About This Presentation
Title:

OxGrid, a campus grid environment for Oxford

Description:

increase utilisation of current and future university resources. substantially increase the research computing power available ... – PowerPoint PPT presentation

Number of Views:39
Avg rating:3.0/5.0
Slides: 33
Provided by: nes5
Category:

less

Transcript and Presenter's Notes

Title: OxGrid, a campus grid environment for Oxford


1
OxGrid, a campus grid environment for Oxford
  • Dr. David Wallom
  • Technical Manager

2
Outline
  • Aims of OxGrid?
  • How have we made OxGrid?
  • Central Systems
  • Software
  • Resources
  • Users
  • The future direction of the project

3
Aim
  • To develop and deploy Grid technology to
  • increase utilisation of current and future
    university resources
  • substantially increase the research computing
    power available
  • capitalise on our status as a core node of the
    NGS and evangelize usage of grids
  • lead international best practice effort at
    running production grids
  • provide user authentication through either UK
    e-Science certificates or university single
    sign-on system

4
OxGrid, a University Campus Grid
  • Single entry point for users to shared and
    dedicated resources
  • Seamless access to NGS and OSC for registered
    users

5
OxGrid Central System
  • Resource Broker
  • Distribution of submitted tasks
  • User access point
  • Information Service
  • Central repository of system status information
  • Virtual Organisation Management and Resource
    Usage Service
  • User/resource control within the grid
  • Record and analyse accounting information so that
    full system as well as single resource usage can
    be recorded
  • Systems monitoring
  • Monitoring system for support, providing first
    point of user contact
  • Storage
  • Dynamic virtual file system independent of rest
    of the system

6
Grid Middleware
  • Virtual Data Toolkit
  • Contains
  • Globus Toolkit version 2.4 with several
    enhancements
  • GSI enhanced OpenSSH
  • myProxy Client Server
  • Has a defined support structure

7
Resource Broker
  • Built on top of Condor-G
  • Allows treatment of a Globus (i.e. remote)
    resource as a local resource
  • Command-line tools available to perform job
    management (submit, query, cancel, etc.) with
    detailed logging
  • Simple job submission language which is
    translated into remote scheduler specific
    language
  • Custom script for determination of resource
    status priority
  • Integrated the Condor Resource description
    mechanism and Globus Monitoring and Discovery
    Service
  • Automated resource discovery
  • Underlying capability matched to system database
    including such information as installed software
    and system load high watermark

8
Resource Broker Operation
9
Currently registered systems
10
Job Submission Script
  • Separates users from underlying Condor system
  • Requires the following information to submit a
    task
  • Executable name,
  • Transfer exe (y/n)
  • Command line arguments to exe
  • Input files
  • Output files
  • Necessary installed software
  • Extra Globus RSL parameters e.g. MPI Job and
    number of concurrent processes
  • When job is submitted script contacts system
    database to retrieve list of systems user has
    permission to use

11
Additional User Tools
  • oxgrid_certificate_import
  • Simplifies the installation of a user digital
    certificate to a single command
  • oxgrid_q
  • Display the users current queue at the resource
    broker. Has the options to allow the user to see
    the full task queue.
  • oxgrid_status
  • Displays the resources that are available to the
    user with options for all resource currently
    registering with the resource broker
  • oxgrid_cleanup
  • Removes either a single submitted process or a
    range of child processes with their master

12
Virtual Organisation Management
  • Globus uses a mapping between certificate
    Distinguished name to local usernames on each
    resource
  • Important that for each resource that a user is
    expecting to use, his DN is mapped locally
  • Make sure the correct resources are registered
  • OxVOM
  • Postgres database, web server, CGI scripts
  • Custom in-house designed Web based user interface
  • Persistent information stored in relational
    databases
  • User DN list retrieved by remote resources using
    standard tools from ldap database

13
OxVOM
14
Resource Usage Service and Accounting
  • Jobmanagers altered to include commands to
    determine job start and stop time as well as
    interface with host scheduling system
  • Information returned from client to RUS server
    when job completed and stored in relational
    database
  • Stored information for a single job includes
  • start end time
  • Execution host, scheduler
  • CPU walltime time
  • Memory used
  • Resource owner controlled cost variable
  • Tune usage from campus grid
  • Version 2 will use GGF Usage Record standard

15
Resource Usage Service
  • Enables presentation of system use to users as
    well as system owners
  • Can form the basis of a charging model

16
Overall system interactions with the VOM Database
VOM WEB INTERFACE
USER ADDITION
SYSTEM ADDITION
USER REMOVAL
SYSTEM REMOVAL
USER LISTING
SYSTEM LISTING
REMOTE SYSTEM CONNECTION
USER TASK RUS INFORMATION
DATABASE
SYSTEM ACCOUNTING
RESOURCE BROKER
USER JOB SUBMISSION
USER QUEUE QUERY
SYSTEM LISTING QUERY
LOCAL SYSTEM CONNECTION
USER COMMAND LINE INTERFACE
17
Core Resources
  • Available to all users of the campus grid
  • Individual Departmental Clusters (PBS, SGE)
  • Grid software interfaces installed
  • Management of users through pool accounts or
    manual account creation.
  • Clusters of PCs
  • Running Condor/SGE
  • Single master running up to 500 nodes
  • Masters run either by owners or OeRC
  • Execution environment on 2nd OS(Linux), Windows
    or Virtual Machine

18
External Resources
  • Only accessible to users that have registered
    with them
  • National Grid Service
  • Peered access with individual systems
  • OSC
  • Gatekeeper system
  • User management done through standard account
    issuing procedures and manual DN mapping
  • Controlled grid submission to Oxford
    Supercomputing Centre
  • Some departmental resources
  • Used as method to bring new resources initially
    online
  • Show the benefits of joining the grid
  • Limited accessibility to donated by other
    departments to maintain incentive to become full
    participants

19
Services necessary to connect to OxGrid
  • For a system to connect to OxGrid
  • Must support a minimum software set (without
    which it is impossible to submit jobs from the
    Resource Broker)
  • GT2 GRAM and RUS reconfigured jobmanager
  • MDS compatible information provider
  • Desirable though not mandated
  • OxVOM compatible grid-mapfile installation
    scripts
  • Scheduling system to give fair-share to users of
    the resource

20
Environmental Condor
  • Cost environmental considerations of using
    spare resources (7000/yr for OUCS Condor pool)
  • New daemon for Condor
  • System starts and stops registering systems
    depending on currently queued tasks.
  • Currently only works with Linux Condor systems

21
Current Compute System Layout
  • Central management services running on single
    server
  • Current resources
  • All Users
  • OUCS Linux Pool (Condor, 250 CPU)
  • Oxford NGS node (PBS, 128 CPU)
  • Condensed Matter Physics (Condor, 10 CPU)
  • Theoretical Physics (SGE,14 CPU)
  • OeRC Cluster (SGE, 5 CPU)
  • High Energy Physics (LCG-PBS,120 CPU) not
    registering with RB
  • Registered users
  • OSC (Zuse, 40 CPU)
  • NGS (all nodes, 342 CPU)
  • Biochemistry (SGE, 30 CPU)
  • Materials Science (SGE, 20 CPU)

22
Planned System Additions
  • Physics Mac teaching laboratory (end Nov)
  • OUCS Mac systems (end Nov, have agreement just
    need time!)
  • Humanities cluster (Nov)
  • Statistics cluster (end Dec)
  • Biochemistry remaining two clusters (end Dec)
  • OSC SRIF3 cluster tranche 1 (2007)
  • Chemistry clusters (contacted department)
  • NGS2 All resources (2007)

23
Data Management
  • Engagement of data as well as computational
    users,
  • Provide a remote store for those groups that
    cannot resource their own,
  • Distribute the client software as widely as
    possible to departments that are not currently
    engaged in e-Research.

24
Data Management
  • Two possible candidates for creation of system
  • Storage Resource Broker to create large virtual
    datastore
  • Through central metadata catalogue users
    interface with single virtual file system though
    physical volumes may be on several network
    resources
  • In built metadata capability
  • Disk Pool Manager
  • Similar virtual disk presentation,
  • Internationally recognised using SRM standard
    interface,
  • No metadata capability,
  • Integrated easily with VO server.

25
Supporting OxGrid
  • First point of contact is OUCS Helpdesk through
    support email.
  • Given a preset list of questions to ask and log
    files to ask to see if available.
  • Not expected to do any actual debugging.
  • Pass problems onto Grid experts who then pass
    problems on a system by system basis to their own
    maintenance staff.
  • Significant cluster support expertise within
    IeRC.
  • As one of the UK e-Science Centres we also have
    access to the Grid Operations and Support Centre.

26
Users
  • Focused on users with serial computation
    problems, individual researchers
  • Statistics (1 user)
  • Materials Science (3 user)
  • Inorganic chemistry (3 users)
  • Theoretical Chemistry (4 users)
  • Biochemistry (8 users)
  • Computational Biology (2 user)
  • Condensed Matter Physics (2 users)
  • Quantum Computational Physics (1 user)

27
User Code Porting
  • User forwards code to OeRC that operates either
    on single node or cluster.
  • Design a wrapper script
  • Creates scratch directory in which all operations
    occur
  • formats configuration information for each child
    process from main configuration
  • Creates execution script and zip file for
    remote execution
  • Submits child process onto grid
  • Waits until all child processes have completed
  • Collates results and archives temp files etc.
  • Deposits scratch directory into SRB repository
  • ltCan remove scratch directory from Resource
    Broker if askedgt
  • Hand code back to user as an example of a real
    computational task they want to do but a possible
    basis for further code porting by themselves

28
OxGrid, Users
Simulation of the quantum dynamics of correlated
electrons in a laser field. OxGrid NGS made
serious computational power easily available and
was crucial for making the simulating algorithm
work. Dr Dmitrii Shalashilin (Theoretical
Chemistry)
Orbitals and Electron Charge Distribution in
Boron Nitride NanostructuresDr. Amanda Barnard,
(Materials Science)
Molecular evolution of a large antigen gene
family in African trypanosomes. OxGrid has been
key to my research and has allowed me to complete
within a few weeks calculations which would have
taken months to run on my desktop.Dr Jay Taylor
(Statistics)
29
Problems
  • Sociological
  • Getting academics to share resources
  • IT officers in departments and colleges
  • Technical
  • Minimal firewall problems
  • Information servers
  • OS Versions
  • Programming languages
  • Time

30
The Future
  • Improve central service software
  • RB usage algorithm
  • Remove central information server
  • Resource broker querying individual remote
    systems is actually more efficient
  • Update Condor-G to latest version to allow
    seamless transition from Pre-WS to WS based
    middleware
  • Design and construct user training courses

31
The Future, 2
  • Develop Windows/Linux Condor pools so that all
    shared systems can be included
  • Develop experimental system to harvest spare disk
    spins so as to ensure complete ROI on shared
    systems.
  • Connect MS Windows Cluster system
  • Package central server modules for public
    distribution
  • Already running on systems in Porto and Barcelona
    universities as well as Monash University
  • Continue contacting users to expand the user base

32
Conclusions
  • Users are already able to log onto the Resource
    Broker and schedule work onto the NGS and OUCS
    Condor Systems
  • Working as quickly as possible to engage more
    users
  • These users will encourage their local systems
    owners (in departments and colleges) to donate
    resource!
  • Need these users to then go out and evangelise.

33
Thanks
  • Co-Designer of parts of the system
  • Jon Wakelin (CeRB)
  • Oxford Sys Administrators
  • Ian Atkin (OUCS)
  • Jon Lockley (OSC)
  • Steven Young (Ox NGS)
  • Users
  • Amanda Barnard (Materials Science)
  • Dr Jay Taylor (Statistics)
  • Dr Dmitry Shalashilin (Theoretical Chemistry)

34
Contact
  • Email david.wallom_at_oerc.ox.ac.uk
  • Telephone 01865 283378
Write a Comment
User Comments (0)
About PowerShow.com