Title: Grid Laboratory Of Wisconsin GLOW
1Grid Laboratory Of Wisconsin(GLOW)
UW Madisons Campus Grid
Dan Bradley Department of Physics
CSRepresenting the GLOW, Condor, and DISUN Teams
2Grid Laboratory of Wisconsin
2003 Initiative funded by NSF/UWSix Initial GLOW
Sites
- Computational Genomics, Chemistry
- Amanda, Ice-cube, Space Science
- CMS, High Energy Physics
- Materials by Design, Chemical Engineering
- Radiation Therapy, Medical Physics
- Condor, Computer Science
- Plus New Members
- ATLAS, High Energy Physics
- Plasma Physics
- Multiscalar, Computer Science
Diverse users with different deadlines and usage
patterns ? high cpu utilization.
3UW Madison Campus Grid
- Condor pools in various departments, made
accessible via Condor flocking - Users submit jobs to their own private or
department Condor scheduler. - Jobs are dynamically matched to available
machines. - Crosses multiple administrative domains.
- No common uid-space across campus.
- No cross-campus NFS for file access.
- Users rely on Condor remote I/O, file-staging,
AFS, SRM, gridftp, etc. - Must deal with firewalls.
- Need to interoperate with outside collaborators.
4UW Campus Grid Machines
- GLOW Condor pool is distributed across the campus
to provide locality with big users. - 1200 2.8 GHz Xeon CPUs
- 400 1.8 GHz Opteron cores
- 100 TB disk
- Computer Science Condor pool
- 1000 1GHz CPUs
- testbed for new Condor releases
- Other private pools
- job submission and execution
- private storage space
- excess jobs flock to GLOW and CS pools
5What About The Grid
- Who needs a campus grid?
- Why not have each cluster join The Grid
independently?
6The Value of Campus Scale
- simplicitysoftware stack is just Linux Condor
- fluidity
- high common denominator makes sharing easier and
provides richer feature-set - collective buying powerwe speak to vendors with
one voice - standardized administratione.g. GLOW uses one
centralized cfengine - synergyface-to-face technical meetingsmailing
list scales well at campus level
7The value of the big G
- Our users want to collaborate outside the bounds
of the campus (e.g. Atlas and CMS are
international). - We dont want to be limited to sharing resources
with people who have made identical technological
choices. - The Open Science Grid gives us the opportunity to
operate at both scales, which is ideal.
8On the OSG Map
Any GLOW member is free to link their
resources to other grids.
facility WISC site UWMadisonCMS
9Submitting Jobs within UW Campus Grid
UW HEP User
HEP matchmaker
CS matchmaker
GLOW matchmaker
flocking
and beyond
- Supports full feature-set of Condor
- matchmaking
- remote system calls
- checkpointing
- MPI
- suspension VMs
- preemption policies
- computing on demand
NCSA matchmaker
10Submitting jobs through OSG to UW Campus Grid
Open Science Grid User
11Routing Jobs fromUW Campus Grid to OSG
GLOW User
HEP matchmaker
CS matchmaker
GLOW matchmaker
Grid JobRouter
- Combining both worlds
- simple, feature-rich local mode
- when possible, transform to grid job for
traveling globally - using OSG Registration Authority within DOE
Certification Authority
12Condor Glidein
GLOW User
matchmaker
startd
startd
startd
- full Condor feature seton top of generic
resources - GCB required to provideconnectivity to worker
nodes
condor gridmanager
13Interoperating from a Condor-centric point of view
- Question can we live in one Condor pool?
- Appropriate scale a few 1000 CPUs (and
counting!) - Need port range through firewall, or use GCB.
- Agree on pool-wide policies
- user-priorities, preemption, and fair share(e.g.
immediate rights to ones own machines) - security policy GSI, Kerberos, IP-based
- Worker node policies may still be
independent(e.g. requirements and ranking of
jobs)
14Interoperating by Flocking
- We dont want to merge pools, but we both run
Condor, so we will flock. - Requires minimal change to existing
configuration. - More independence in pool policies.
- Must still open firewalls or use GCB.(Current
work on GCB plans to make this a less intrusive
option.)
15Interoperating through OSG
- Are you a member of a Virtual Organization in the
Open Science Grid? - If so, our GUMS service probably already
recognizes you. - Collaborators within GLOW may grant you something
better than opportunistic priority.(e.g. CS is
collaborating with CMS, Atlas, NanoHub, ) - Allocating storage is currently a trickier
proposition.
16A to-do list
- Core Condor technologies
- Scalability, interconnectivity, security
- Exporting jobs to non-Condor resources
- Mixed success so far. Can we get our non-HEP
users on board too? - Dealing with data
- We dont seem to be heading towards a common
storage solution within our campus grid. Should
we?