Title: FermiGrid Fermilab Grid Gateway
1FermiGrid Fermilab Grid Gateway
- Keith Chadwick
- Bonnie Alcorn
- Steve Timm
2FermiGrid - Strategy and Goals
- In order to better serve the entire program of
the laboratory the Computing Division will place
all of its production resources in a Grid
infrastructure called FermiGrid. This strategy
will continue to allow the large experiments who
currently have dedicated resources to have first
priority usage of certain resources that are
purchased on their behalf. It will allow access
to these dedicated resources, as well as other
shared Farm and Analysis resources, for
opportunistic use by various Virtual
Organizations (VOs) that participate in FermiGrid
(i.e. all of our lab programs) and by certain VOs
that use the Open Science Grid. (Add something
about prioritization and scheduling lab/CD
new forums). The strategy will allow us - to optimize use of resources at Fermilab
- to make a coherent way of putting Fermilab on the
Open Science Grid - to save some effort and resources by implementing
certain shared services and approaches - to work together more coherently to move all of
our applications and services to run on the Grid - to better handle a transition from Run II to LHC
(and eventually to BTeV) in a time of shrinking
budgets and possibly shrinking resources for Run
II worldwide - to fully support Open Science Grid and the LHC
Computing Grid and gain positive benefit from
this emerging infrastructure in the US and Europe.
3FermiGrid What It Is
- FermiGrid is a meta-facility composed of a number
of existing resources, many of which are
currently dedicated to the exclusive use of a
particular stakeholder. - FermiGrid (the facility) provides a way for jobs
of one VO to run either on shared facilities
(such as the current General Purpose Farm or a
new GridFarm?) or on the Farms primarily provided
for other VOs. (gtgtgt needs wordsmithing to say
what not how) - FermiGrid will require some development and test
facilities to be put in place in order to make it
happen. - FermiGrid will provide access to storage elements
and storage and data movement services for jobs
running on any of the compute elements of
FermiGrid - The resources that comprise FermiGrid will
continue to be accessible in local mode as well
as Grid mode
4The FermiGrid Project
- This is a cooperative project across the
Computing Division and its stakeholders to define
and execute the steps necessary to achieve the
goals of FermiGrid - Effort is expected to come from
- Providers of shared resources and services CSS
and CCF - Stakeholders and providers of currently dedicated
resources - Run II, CMS, MINOS, SDSS - The total program of work is not fully known at
this time but the WBS is being fleshed out. It
will involve at least the following - Adding services required by some stakeholders to
other stakeholders dedicated resources - Work on authorization and accounting
- Providing some common FermiGrid Services (e.g .
) - Providing some head-nodes and gateway machines
- Modifying some stakeholders scripts, codes, etc.
to run in the FermiGrid environment - Working with OSG technical activities to make
sure FermiGrid and OSG (and thereby LCG) are well
aligned and interoperable - Working on monitoring and web pages and whatever
else it takes to make this all work and happen - Evolving and defining forums for prioritizing
access to resources and scheduling
5FermiGrid Some Notations
- Condor Condor / Condor-G as necessary.
6FermiGrid The Situation Today
- Many separate clusters
- CDF (x3), CMS, D0 (x3), GP Farms, FNALU Batch,
etc. - When the cluster landlord does not fully
utilize the cluster cycles it is very difficult
for others to opportunistically utilize the
excess computing capacity. - In the face of flat or declining budgets, we need
to make the most effective use of the computing
capacity. - We need some sort of system to capture the unused
available computing and put it to use.
7FermiGrid The State of Chaos Today
CDF Clusters
D0 Clusters
CMS Clusters
GP Farms
8FermiGrid The Vision
- The Future is Grid enabled computing.
- Dedicated systems resources will be assimilated
slowly... - Existing access to resources will be maintained.
- I am chadwick of grid prepare to be
assimilated Not! - Enable Grid based computing, but do not require
all computing to be Grid. - Preserve existing access to resources for current
installations. - Let a thousand flowers bloom Well not quite.
- Implement Grid interfaces to existing resources
without perturbation of existing access
mechanisms. - Once FermiGrid is in production, deploy new
systems as Grid enabled from the get go. - People will naturally migrate when they need
expanded resources. - Help people with their migrations?
9FermiGrid The Mission
- FermiGrid is the Fermilab Grid Gateway
infrastructure to accept jobs from the Open
Science Grid, and following appropriate
credential authorization, schedule these jobs for
execution on Fermilab Grid resources.
10FermiGrid The Rules
- First do no harm
- Wherever possible, implement such that existing
systems and infrastructure is not compromised. - Only when absolutely necessary, require changes
in existing systems or infrastructure, and work
with those affected to minimize and mitigate the
impact of the required changes. - Provide resources and infrastructure to help
experiments transition to a Grid enabled model of
operation.
11FermiGrid Players and Roles
- CSS
- Hardware Operating System Management Support.
- CCF
- Grid Infrastructure Application Management
Support. - OSG A cast of thousands
- Submit Jobs utilize resources.
- CDF
- D0
- CMS
- Lattice QCD
- Sloan
- Minos
- MiniBoone
- FNAL
- Others?
12FermiGrid System Evolution
- Start small, but plan for success.
- Build the FermiGrid gateway system as a cluster
of redundant server systems to provide 24x7
service. - Initial implementation will not be redundant,
that will follow as soon as we learn how to
implement the necessary failovers. - Were going to have to experiment a bit an learn
how to operate these services. - We will need the capability of testing upgrades
without impacting production services. - Schedule OSG jobs on excess/unused cycles from
existing systems and infrastructure. - How? Initial thoughts were to utilize checkpoint
capability within Condor. - Feedback from D0 and CMS is that this is not an
acceptable solution. - Alternatives 24 hour CPU limit?, nice?, other?
- Will think about this more policy?.
- Just think of FermiGrid like PACMAN (munch,
munch, munch)
13FermiGrid Software Components
- Operating System and Tools
- Scientific Linux 3.0.3
- VDT Globus toolkit.
- Cluster tools
- Keep the cluster sane.
- Migrate services as necessary.
- Cluster aware file system
- Google file system?
- Lustre?
- other?.
- Applications and Tools
- VOMS VOMRS
- GUMS
- Condor-G GRIS GIIS
14FermiGrid Overall Architecture
FermiGrid Common Gateway Services
SAZ
Storage SRM dcache
Lattice QCD
SDSS
GPFarm
CMS
D0
HN
CDF
HN
HN
HN
HN
HN
Storage
Storage
Storage
Storage
Storage
Storage
15FermiGrid General Purpose Farm Example
FermiGrid
GP Farm Users
Via Globus / Condor
Farm Head Node
FBS
The D0 Wolf stealing food out of the mouth of
babies.
16FermiGrid D0 Example
D0 Jobs
FermiGrid
SamGrid
Via Globus / Condor
Globus / Condor
FNSF0
SamGfarm
FBS
Babies stealing food out of the mouth of the D0
wolf
17FermiGrid Future Grid Farms?
FermiGrid
Via Globus / Condor
18FermiGrid Gateway Software
See http//computing.fnal.gov/docs/products/vopri
vilege/index.html
19FermiGrid Gateway Hardware Architecture
FermiGrid
FNAL
20FermiGrid Gateway Hardware Roles
- FermiGate1
- Primary for Condor GRIS GIIS
- Backup for FermiGate2
- Secondary backup for FermiGate3
- FermiGate2
- Primary for VOMS VOMRS
- Backup for FermiGate3
- Secondary backup for FermiGate1
- FermiGate3
- Primary for GUMS PRIMA (eventually)
- Backup for FermiGate1
- Secondary backup for FermiGate2
- All FermiGate systems will have VDT Globus job
manager.
21FermiGrid Gateway Hardware Specification
- 3 x Poweredge 6650
- Dual processor 3.0 Xeon MP, 4 MB cache
- Rapid rails for dell rack
- 4 GB DDR SDRAM, 8x512
- PERC3-DC, 128MB 1 int, 1 ext.
- 2x 36GB 15k RPM drive
- 2x 73GB 10k RPM drive
- dual on-board 10/100/1000 nics
- Redundant power supply
- Dell Remote Access Card, Version III, without
modem - 24x IDE CD-Rom
- Poweredge Basic Setup
- 3yr same day 4 hr response parts _ onsite labor
24x7 - 14,352.09 each
- Cyclades console dual PM20 local switch
Rack - Total system cost 50K
- Expandable in place by addition of processors or
disks within systems.
22FermiGrid Alternate Hardware Specification
- 3 x Poweredge 2850 (2U server)
- Dual processor 3.6 Xeon, 1MB cache, 800 MHz FSB
- Rapid rails for dell rack
- 4 GB DDR2 400 MHZ 4x1GB
- Embedded Perc4ei controller
- 2x 36Gb 15K RPM drive
- 2x 73Gb 10K RPM drive
- Dual on-board 10/100/1000 nics
- Redundant power supply
- Dell Remote Access Card, 4th generation
- 24x IDE CD-Rom
- Poweredge Basic Setup
- 3yr same day 4 hr response 24x7 parts onsite
labor - 6,951.24 each
- Cyclades console dual PM20 local switch
Rack - Total system cost 25K
- Limited CPU expandability can only add whole
systems or perform forklift upgrade.
23FermiGrid Condor and Condor-G
- Condor (Condor-G) will be used for batch queue
management. - Within FermiGrid gateway systems definitely.
- May feed into other head node batch systems (eg.
FBS) as necessary. - VOs that own the resource will have priority
access to the resource. - Policy? - guest VOs will only be allowed to
utilize idle/unused resources. - Policy? how quickly must a guest VO free
resource when desired by owner VO? - Condor checkpoint would provide this, but D0 and
CMS jobs will not function in this environment. - Alternatives 24 hour CPU limit?, nice?, other?
- More thought required (perhaps helped by policy
decisions above?). - For Condor information see
- http//www.cs.wisc.edu/condor/
24FermiGrid VO Management
- Currently VO management is performed via CMS in a
back pocket fashion. - Not a viable solution for the long term.
- CMS would probably like to direct that effort
towards their work. - We recommend that FermiGrid infrastructure should
take over the VO Management Server/services and
migrate onto the appropriate gateway system
(FermiGate2). - Existing VOs should be migrated to the new VO
Management Server (in the FermiGrid gateway) once
the FermiGrid gateway is commissioned. - Existing VO management roles delegated to
appropriate members of the current VOs. - New VOs for existing infrastructure clients (eg.
FNAL, CDF, D0, CMS, Lattice QCD, SDSS, others)
should be created as necessary/authorized.
25FermiGrid VO Creation and Support
- All new VOs created on the new VO Management
Server by FermiGrid project personnel or
Helpdesk. - Policy? - VO creation authorization mechanism?
- VO management authority delegated to the
appropriate members of the VO. - Policy? - FNAL VO membership administered by
the Helpdesk? - Like accounts in the FNAL Kerberos domain and
Fermi Windows 2000 domain. - Policy? - Small experiments may apply to CD to
have their VO managed by the Helpdesk also? - Need to provide the Helpdesk with the necessary
tools for VO membership management.
26FermiGrid GUMS
- Grid User Management System
- Developed at BNL
- Translates a Grid identity to a local identity
(certificate -gt local user) - Think of it as an automated mechanism to maintain
the gridmap file. - See
- http//www.rhic.bnl.gov/hepix/talks/041018pm/carca
ssi.ppt
27FermiGrid Project Management
- Weekly FermiGrid project management meeting
- Fridays from 200 PM to 300 PM in FCC1.
- We would like to empanel a set of Godparents
- Representatives from
- CMS
- Run II
- Grid Developers?
- Security Team?
- Other?
- Godparent panel would be used to provide (short
term?) guidance and feedback to the FermiGrid
project management team. - Longer term guidance and policy from CD line
management.
28FermiGrid Time Scale for Implementation
- Today Decide and order hardware for gateway
systems - Explore / kick tires on existing software.
- Jan 2005 Hardware installation.
- Begin software installation and initial
configure. - Feb-Mar 2005 Common Grid services available in
non-redundant mode (Condor-G, VOMS, GUMS,
etc.). - Future Transition to redundant mode as
hardware/software matures.
29FermiGrid Open Questions
- Policy Issues?
- Lots of policy issues need direction from CD
management. - Role of FermiGrid?
- Direct Grid access to Fermilab Grid resources
directly without FermiGrid? - Grid access to Fermilab Grid resources only via
FermiGrid? - guest VO access to Fermilab Grid resources only
via FermiGrid? - Resource Allocation?
- owner VO vs. guest VO?
- How fast?
- Under what circumstances?
- Grid Users Meeting a-la Farm Users Meeting?
- Accounting?
- Who, where, what, when, how?
- Recording vs. Access.
30FermiGrid Guest vs. Owner VO Access
Required?
Allowed
FermiGrid Gateway
Not Allowed?
Allowed?
Resource Head Node
31FermiGrid Fin