Digital Sherpa: Custom Grid Applications on the TeraGrid and Beyond - PowerPoint PPT Presentation

1 / 40
About This Presentation
Title:

Digital Sherpa: Custom Grid Applications on the TeraGrid and Beyond

Description:

'babysitter' scripts 'babysitter' scripts are common but in general they ... Conceptual details of BabySitter. Resource manager and handler ... Past: babysitter ... – PowerPoint PPT presentation

Number of Views:92
Avg rating:3.0/5.0
Slides: 41
Provided by: ronp3
Learn more at: http://www.ggf.org
Category:

less

Transcript and Presenter's Notes

Title: Digital Sherpa: Custom Grid Applications on the TeraGrid and Beyond


1
Digital Sherpa Custom Grid Applications on the
TeraGrid and Beyond
  • GGF18 / GridWorld 2006
  • Ronald C. Price, Victor E. Bazterra, Wayne
    Bradford, Julio C. Facelli
  • Center for High Performance Computing at the
    University of Utah
  • Partially funded by NSF ITR award 0326027

2
First Things First
  • HAPPY BIRTHDAY GLOBUS!!!

3
Roles Acknowledgments
  • Ron Grid Architect and Software Engineer
  • Victor Research Scientist Grid Researcher,
    user of many HPC Resources
  • Wayne Grid Sys Admin
  • Julio Director
  • Globus Mailing list and especially the Globus
    Alliance
  • Entire Center for High Performance Computing
    University of Utah Staff

4
Overview
  • Problem Solution
  • general problem
  • Solution
  • traditional approaches
  • Past
  • sys admin caveats (briefly)
  • concepts and implementation
  • Present
  • examples
  • applications
  • Future
  • applications
  • features

5
General Problem Solution
  • General Problem
  • Many High Performance Computing (HPC) scientific
    projects require large number of loosely coupled
    executions in numerous HPC resources which can
    not be managed manually.
  • Solution (Digital Sherpa)
  • Distribute the jobs of HPC scientific
    applications across a grid allowing access to
    more resources with automatic staging, jobs
    submission, monitoring, fault recovery and
    efficiency improvement.

6
Traditional Approachbabysitter scripts
  • babysitter scripts are common but in general
    they have some problems
  • not scalable (written to work with a specific
    scheduler)
  • Hard to maintain (typically a hack)
  • not portable (system specific)

7
Digital Sherpa Perspective
  • A different perspective
  • Schedulers System Oriented Perspective
  • Many jobs on one HPC resource, user doesnt have
    control
  • Sherpa User Oriented perspective
  • Many jobs on many resource, user has control

8
Digital Sherpa In General
  • Digital Sherpa is a grid application for
    executing HPC applications across many grid
    enabled HPC resources.
  • It automates non-scalable tasks such as staging,
    job submission and monitoring, including recovery
    features such as resubmission of failed jobs.
  • The goal is to allow any HPC application to
    easily interoperate with Digital Sherpa to become
    a custom grid application.
  • Distributing the jobs across HPC resources
    increases the amount of computer resources that
    can be accessed at a given time.
  • Success using Digital Sherpa has been found on
    the TeraGrid and there are many more applications
    of Digital Sherpa in progress.

9
So, what is Digital Sherpa?
  • Naming Convention for rest of Slides Digital
    Sherpa Sherpa
  • Sherpa is a multi threaded custom extension of
    the GT4 WS-GRAM client.
  • Sherpa has been designed and planned to be
    scalable, maintainable and used directly by
    people or other applications.
  • It is based on Web Services Resource Framework
    (WSRF) and it is implemented in Java 1.5 using
    the Globus Toolkit 4.0 (GT4).
  • Sherpa has the ability to do a complete HPC
    submission (stage data in, run/monitor PBS job,
    stage data out and auto restart of failed jobs,
    improve efficiency)

10
Why the name Sherpa?
  • Digital Sherpa takes its name from sherpa who
    are known for their great mountaineering skills
    in the Himalayas, expert route finders and
    porters.
  • find the route for you (find an HPC resource for
    your needs, future feature )
  • carry gear in for you (stage data in)
  • climb to the top (execute job and restart job if
    necessary)
  • and carry gear out for you (stage data out).

11
Benefits and Significance
  • Benefits
  • Automation of login, data stage in and stage out,
    job submission, monitoring, and auto restart if
    the job fails, efficiency improvement
  • Distribute your jobs across various HPC resources
    to increase the amount of resources that can be
    used at a time.
  • Reduction of queue wait time by submitting jobs
    to several queues resulting in an increase of
    efficiency
  • Load balancing from increased granularity
  • Can be called from a separate application
  • Significance
  • Automates the flow of large number of jobs within
    grid environments
  • Increases throughput of HPC Scientific
    Applications

12
Globus Toolkit 4
  • The Globus Toolkit is an open source software
    toolkit used for building Grid systems and
    applications
  • Globus Toolkit 4.0.x (GT4) is the most recent
    release
  • GT4 is best thought of as a Grid Development Kit
    (GDK)
  • GT4 has four main components
  • Grid Security Infrastructure (GSI)
  • Reliable File Transfer (RFT)
  • Web Services - Monitoring and Discovery Service
    (WS-MDS)
  • Web Services Grid Resource Allocation
    Management (WS-GRAM)

13
Sherpa Requirements
  • Globus Tookit 4
  • Dependent GT4 Components
  • WS-GRAM (Execution Management)
  • RFT (Data Management)
  • Java 1.5

14
Past Sys Admin Caveats
  • Did a lot of initial testing and configuration
  • Build notes
  • http//wiki.chpc.utah.edu/index.php/System_Adminis
    tration_and_GT4_An_Addendum_to_the_Globus_Allianc
    e_Quick_Start_Guide
  • GT 4.0.2 doesnt require postgres config

15
Motivations for Creating Sherpa
  • Reasons for Creating Digital Sherpa, Motivations
  • Allow scientists to be scientists in their own
    fields, dont force them to become computer
    scientists
  • Eliminate error prone time consuming non-scalable
    tasks of job submission, monitoring, data
    staging
  • Allow easy access to more resources
  • Reduce total queue time
  • Increase efficiency

16
Before Sherpa BabySitter
  • BabySitter
  • before GT4
  • Conceptual details of BabySitter
  • Resource manager and handler
  • Proprietary states similar to the external states
    of the managed job services in WS-GRAM
  • Not a general solution, scheduler specific
  • Took GT4 into the lab as it became available

17
Sherpa ConceptuallyPast and Present States
  • Past
  • Null, idle, running, done
  • Realized Globus Alliance had already defined the
    states as GT4 was finalized
  • Present external states of the managed job
    services in WS-GRAM
  • Unsubmitted, StageIn, Pending, Active, Suspended,
    StageOut, CleanUp, Done, Failed

18
Digital Sherpa Implementation Choice of API,
Past and Present
  • Past babysitter
  • Java app using J2SSH to login to HPC resource and
    then query the output from the scheduler
  • Present GT4 GDK
  • WS-GRAM API
  • when I wrote the Sherpa code JavaCOG and GAT did
    not work with GT4 and I needed GT4
  • WS-GRAM hides scheduler specific complexities

19
The BLAH Example Test Jobs
  • A test case for Sherpa _blah.xml corresponds
    to _blah.out and _blahblah.xml corresponds
    to blahblah.out
  • Stage In
  • Local blahsrc.txt - remote RFT server blah.txt
  • Run
  • /bin/more blah.txt (std out to blahtemp.out)
  • Stage Out
  • Remote RFT serverblahtemp.out - local blah.out
  • Clean Up
  • deletes blahtemp.out at remote HPC resource

20
Sherpa Input File
  • Made use of the WS-GRAM XML Schema
  • Example argonne_blah.xml
  • File walk through

21
BLAH on TeraGridSherpa in Action
  • -bash-3.00 java -DGLOBUS_LOCATIONGLOBUS_LOCATIO
    N Sherpa argonne_blah.xml purdue_blahblahblah.xml
    ncsamercury_blahblah.xmlStarting job in
    argonne_blah.xmlHandler 1 Starting...argonne_blah
    .xmlStarting job in purdue_blahblahblah.xmlHand
    ler 2 Starting...purdue_blahblahblah.xmlStarting
    job in ncsamercury_blahblah.xmlHandler 3
    Starting...ncsamercury_blahblah.xmlHandler 3
    StageInHandler 2 StageInHandler 1
    StageInHandler 3 PendingHandler 1
    PendingHandler 2 PendingHandler 2
    ActiveHandler 2 StageOutHandler 1
    ActiveHandler 2 CleanUpHandler 2 DoneHandler
    2 Complete.Handler 3 ActiveHandler 1
    StageOutHandler 3 StageOutHandler 1
    CleanUpHandler 3 CleanUpHandler 1
    DoneHandler 1 Complete.Handler 3 DoneHandler
    3 Complete.-bash-3.00 hostname
    -fwatchman.chpc.utah.edu

22
Sherpa Purdue Test Results
  • -bash-3.00 more .outblahblahblah
    .outBLAH BLAH BLAH
  • No PBS epilogue or prologue

23
Sherpa NCSA MercuryResults
  • blahblah.out-------
    ---------------------------------Begin PBS
    Prologue Thu Apr 27 131709 CDT 2006Job
    ID         612149.tg-master.ncsa.teragrid.orgUse
    rname       priceGroup         
    oorNodes          tg-c421End PBS Prologue Thu
    Apr 27 131713 CDT 2006-------------------------
    ---------------BLAH BLAH------------------------
    ----------------Begin PBS Epilogue Thu Apr 27
    131720 CDT 2006Job ID        
    612149.tg-master.ncsa.teragrid.orgUsername      
    priceGroup          oorJob Name      
    STDINSession        4042Limits        
    ncpus1,nodes1,walltime001000Resources     
    cput000001,mem0kb,vmem0kb,walltime000006Q
    ueue          dqueAccount               
    mudNodes          tg-c421Killing
    leftovers...End PBS Epilogue Thu Apr 27
    131724 CDT 2006--------------------------------
    --------

24
Sherpa UC/ANL TestResults
  • blah.out-----------
    -----------------------------Begin PBS Prologue
    Thu Apr 27 131653 CDT 2006Job ID        
    251168.tg-master.uc.teragrid.orgUsername      
    rpriceGroup          allocateNodes         
    tg-c061End PBS Prologue Thu Apr 27 131654 CDT
    2006----------------------------------------BLAH
    ----------------------------------------Begin
    PBS Epilogue Thu Apr 27 131700 CDT 2006Job
    ID         251168.tg-master.uc.teragrid.orgUsern
    ame       rpriceGroup          allocateJob
    Name       STDINSession       
    11367Limits         nodes1,walltime001500Re
    sources      cput000001,mem0kb,vmem0kb,wallt
    ime000002Queue          dqueAccount        
            TG-MCA01S027Nodes         
    tg-c061Killing leftovers...End PBS Epilogue
    Thu Apr 27 131716 CDT 2006---------------------
    -------------------

25
MGAC Background
  • Modified Genetic Algorithms for Crystals and
    Atomic Clusters (MGAC), an HPC chemistry
    application written in C
  • In short based off of an energy criteria MGAC
    tries to predict the chemical structure
  • Computing Needs local serial computations and
    distributed parallel computations

26
MGAC Circular Flow
27
MGAC-CGA Real Science
28
Efficiency and HPC Resources
  • Scheduler Side Effect
  • 1 job submitted requiring 5 calculations
  • 4 calculatons require 1 hour of compute time each
  • 1 calculation requires 10 hours of compute time
  • The other 4 nodes are still reserved although not
    being used and they cant be used by anyone else
    until the 10hr job has finished 49 36hrs of
    wasted compute time

29
Minimization Waste ChartMGAC
30
Minimization Use ChartMGAC-CGA
31
Efficiency and HPC Resources
  • Guesstimate in one common MGAC run our average
    efficiency due to scheduler side effect is 46,
  • 54 or resources are wasted
  • Sherpa continuously submits one job at a time
    which reduces the scheduler side effect because
    multiple schedulers are involved and jobs are
    submitted in a more granular fashion
  • Improved Efficiency 1 increased granularity
  • Necessary sharing policies prohibit large number
    of jobs from being submitted all at one HPC
    resource, queue times become to long
  • Improved Efficiency 2 access to more resources
  • Guesstimate total computational time (including
    queue time) reduced by 89-60 in our initial
    testing.

32
Sherpa Performance Load Capability
  • Performance
  • Sherpa is light weight, computationally intensive
    operations are done at HPC resource
  • Memory intensive
  • Load Capability
  • Hard to create a huge test case, need unique file
    names
  • Ran out of file handles around 100,000 jobs
    without any HPC submission ( turned out system
    image software was misconfigured )
  • Successfully initiated 500 jobs
  • Emphasis on initiated, 500 jobs appeared in the
    test queue and although many ran to completion we
    did not have time to let them all run to
    completion

33
Host Cert and Sherpa
  • Globus GSI
  • Uses PKI to verify that users and hosts are who
    they claim to be, creates trust
  • User certs and host certs are different and they
    provide different functionality
  • Sherpa Requires a Globus host certificate
  • ORNL granted us one
  • Policy changed got CRLd
  • Confusion Either WS-GRAM or RFT was requiring a
    valid host cert
  • Had to know if there was a way around the
    situation
  • Did some testing to investigate and trouble shoot

34
Testing/Trouble Shooting
35
TeraGrid CA Caveats
  • How do you allow your machines to fully
    interoperate with the TeraGrid without a host
    cert from a trusted CA?
  • Not Possible.
  • How do you get a host cert for the TeraGrid?
  • From least scalable to most scalable
  • Work with site specific orgs to accept your CA's
    certs.  (tedious for multiple sites)
  • Get TeraGrid security working groups approval for
    Local University CA (time consuming, not EDU
    scalable)
  • Get a TeraGrid trusted CA to issue you one.  
    (unlikely as site policy seems to contradict
    this)
  • Become a TG member
  • Side Note A satisfactory scalable solution does
    not seem to be currently in place and it's our
    understanding that Shibboleth and/or
    International Grid Trust Federation (IGTF) will
    eventually offer this service for EDU's in the
    future.

36
Not the EndSherpa is Flexible
  • Sherpa can work between any two machines that
    have GT4 installed and configured
  • Flexible
  • Can work in many locations
  • Implicitly follows open standards

37
Future Projects
  • MGAC-CGA is the first example, we have other
    projects with Sherpa
  • Nanotechnology simulation (web application)
  • Biomolecular docking (circular flow)
  • AKA protein docking, drug discovery
  • Combustion simulation (web application)

38
Future Features and Implementation
  • Future efforts will be directed towards
  • implementing monitoring and discovery client
    logic
  • polling feature that will help identify when
    system related issues have occurred (i.e. network
    down, scheduler unavailable)
  • Grid Proxy Auto Renewal.
  • Implementation (move to a more general API)
  • Simple API for Grid Apps Research Group
    (SAGA-RG)
  • Grid Application Toolkit (GAT)
  • JavaCOG

39
How do I get a Hold of Sherpa?
  • We are interested in collaborative efforts.
  • Sorry, cant download Sherpa because we dont
    have the man power for support right now.

40
QA With Audience
  • Mail Questions to ronald.charles.price_at_gmail.com
  • Slides Availble at http//www.chpc.utah.edu/rpri
    ce/grid_world_2006/ron_price_grid_world_presentati
    on.ppt
Write a Comment
User Comments (0)
About PowerShow.com