FermiGrid Fermilab Grid Gateway - PowerPoint PPT Presentation

1 / 31

About This Presentation

Title:

FermiGrid Fermilab Grid Gateway

Description:

HN. HN. HN. HN. HN. SDSS: Lattice. QCD: CMS: GP. Farm: SAZ. VO. Users. Storage. Storage. Storage ... VO creation authorization mechanism? ... – PowerPoint PPT presentation

Number of Views:73

Avg rating:3.0/5.0

Slides: 32

Provided by: keithch9

Learn more at: https://pingprod.fnal.gov

Category:

more less

Transcript and Presenter's Notes

Title: FermiGrid Fermilab Grid Gateway

1
FermiGrid Fermilab Grid Gateway

Keith Chadwick
Bonnie Alcorn
Steve Timm

2
FermiGrid - Strategy and Goals

In order to better serve the entire program of
the laboratory the Computing Division will place
all of its production resources in a Grid
infrastructure called FermiGrid. This strategy
will continue to allow the large experiments who
currently have dedicated resources to have first
priority usage of certain resources that are
purchased on their behalf. It will allow access
to these dedicated resources, as well as other
shared Farm and Analysis resources, for
opportunistic use by various Virtual
Organizations (VOs) that participate in FermiGrid
(i.e. all of our lab programs) and by certain VOs
that use the Open Science Grid. (Add something
about prioritization and scheduling lab/CD
new forums). The strategy will allow us
to optimize use of resources at Fermilab
to make a coherent way of putting Fermilab on the
Open Science Grid
to save some effort and resources by implementing
certain shared services and approaches
to work together more coherently to move all of
our applications and services to run on the Grid
to better handle a transition from Run II to LHC
(and eventually to BTeV) in a time of shrinking
budgets and possibly shrinking resources for Run
II worldwide
to fully support Open Science Grid and the LHC
Computing Grid and gain positive benefit from
this emerging infrastructure in the US and Europe.

3
FermiGrid What It Is

FermiGrid is a meta-facility composed of a number
of existing resources, many of which are
currently dedicated to the exclusive use of a
particular stakeholder.
FermiGrid (the facility) provides a way for jobs
of one VO to run either on shared facilities
(such as the current General Purpose Farm or a
new GridFarm?) or on the Farms primarily provided
for other VOs. (gtgtgt needs wordsmithing to say
what not how)
FermiGrid will require some development and test
facilities to be put in place in order to make it
happen.
FermiGrid will provide access to storage elements
and storage and data movement services for jobs
running on any of the compute elements of
FermiGrid
The resources that comprise FermiGrid will
continue to be accessible in local mode as well
as Grid mode

4
The FermiGrid Project

This is a cooperative project across the
Computing Division and its stakeholders to define
and execute the steps necessary to achieve the
goals of FermiGrid
Effort is expected to come from
Providers of shared resources and services CSS
and CCF
Stakeholders and providers of currently dedicated
resources - Run II, CMS, MINOS, SDSS
The total program of work is not fully known at
this time but the WBS is being fleshed out. It
will involve at least the following
Adding services required by some stakeholders to
other stakeholders dedicated resources
Work on authorization and accounting
Providing some common FermiGrid Services (e.g .
)
Providing some head-nodes and gateway machines
Modifying some stakeholders scripts, codes, etc.
to run in the FermiGrid environment
Working with OSG technical activities to make
sure FermiGrid and OSG (and thereby LCG) are well
aligned and interoperable
Working on monitoring and web pages and whatever
else it takes to make this all work and happen
Evolving and defining forums for prioritizing
access to resources and scheduling

5
FermiGrid Some Notations

Condor Condor / Condor-G as necessary.

6
FermiGrid The Situation Today

Many separate clusters
CDF (x3), CMS, D0 (x3), GP Farms, FNALU Batch,
etc.
When the cluster landlord does not fully
utilize the cluster cycles it is very difficult
for others to opportunistically utilize the
excess computing capacity.
In the face of flat or declining budgets, we need
to make the most effective use of the computing
capacity.
We need some sort of system to capture the unused
available computing and put it to use.

7
FermiGrid The State of Chaos Today
CDF Clusters
D0 Clusters
CMS Clusters
GP Farms
8
FermiGrid The Vision

The Future is Grid enabled computing.
Dedicated systems resources will be assimilated
slowly...
Existing access to resources will be maintained.
I am chadwick of grid prepare to be
assimilated Not!
Enable Grid based computing, but do not require
all computing to be Grid.
Preserve existing access to resources for current
installations.
Let a thousand flowers bloom Well not quite.
Implement Grid interfaces to existing resources
without perturbation of existing access
mechanisms.
Once FermiGrid is in production, deploy new
systems as Grid enabled from the get go.
People will naturally migrate when they need
expanded resources.
Help people with their migrations?

9
FermiGrid The Mission

FermiGrid is the Fermilab Grid Gateway
infrastructure to accept jobs from the Open
Science Grid, and following appropriate
credential authorization, schedule these jobs for
execution on Fermilab Grid resources.

10
FermiGrid The Rules

First do no harm
Wherever possible, implement such that existing
systems and infrastructure is not compromised.
Only when absolutely necessary, require changes
in existing systems or infrastructure, and work
with those affected to minimize and mitigate the
impact of the required changes.
Provide resources and infrastructure to help
experiments transition to a Grid enabled model of
operation.

11
FermiGrid Players and Roles

CSS
Hardware Operating System Management Support.
CCF
Grid Infrastructure Application Management
Support.
OSG A cast of thousands
Submit Jobs utilize resources.
CDF
D0
CMS
Lattice QCD
Sloan
Minos
MiniBoone
FNAL
Others?

12
FermiGrid System Evolution

Start small, but plan for success.
Build the FermiGrid gateway system as a cluster
of redundant server systems to provide 24x7
service.
Initial implementation will not be redundant,
that will follow as soon as we learn how to
implement the necessary failovers.
Were going to have to experiment a bit an learn
how to operate these services.
We will need the capability of testing upgrades
without impacting production services.
Schedule OSG jobs on excess/unused cycles from
existing systems and infrastructure.
How? Initial thoughts were to utilize checkpoint
capability within Condor.
Feedback from D0 and CMS is that this is not an
acceptable solution.
Alternatives 24 hour CPU limit?, nice?, other?
Will think about this more policy?.
Just think of FermiGrid like PACMAN (munch,
munch, munch)

13
FermiGrid Software Components

Operating System and Tools
Scientific Linux 3.0.3
VDT Globus toolkit.
Cluster tools
Keep the cluster sane.
Migrate services as necessary.
Cluster aware file system
Google file system?
Lustre?
other?.
Applications and Tools
VOMS VOMRS
GUMS
Condor-G GRIS GIIS

14
FermiGrid Overall Architecture
FermiGrid Common Gateway Services
SAZ
Storage SRM dcache
Lattice QCD
SDSS
GPFarm
CMS
D0
HN
CDF
HN
HN
HN
HN
HN
Storage
Storage
Storage
Storage
Storage
Storage
15
FermiGrid General Purpose Farm Example
FermiGrid
GP Farm Users
Via Globus / Condor
Farm Head Node
FBS
The D0 Wolf stealing food out of the mouth of
babies.
16
FermiGrid D0 Example
D0 Jobs
FermiGrid
SamGrid
Via Globus / Condor
Globus / Condor
FNSF0
SamGfarm
FBS
Babies stealing food out of the mouth of the D0
wolf
17
FermiGrid Future Grid Farms?
FermiGrid
Via Globus / Condor
18
FermiGrid Gateway Software
See http//computing.fnal.gov/docs/products/vopri
vilege/index.html
19
FermiGrid Gateway Hardware Architecture
FermiGrid
FNAL
20
FermiGrid Gateway Hardware Roles

FermiGate1
Primary for Condor GRIS GIIS
Backup for FermiGate2
Secondary backup for FermiGate3
FermiGate2
Primary for VOMS VOMRS
Backup for FermiGate3
Secondary backup for FermiGate1
FermiGate3
Primary for GUMS PRIMA (eventually)
Backup for FermiGate1
Secondary backup for FermiGate2
All FermiGate systems will have VDT Globus job
manager.

21
FermiGrid Gateway Hardware Specification

3 x Poweredge 6650
Dual processor 3.0 Xeon MP, 4 MB cache
Rapid rails for dell rack
4 GB DDR SDRAM, 8x512
PERC3-DC, 128MB 1 int, 1 ext.
2x 36GB 15k RPM drive
2x 73GB 10k RPM drive
dual on-board 10/100/1000 nics
Redundant power supply
Dell Remote Access Card, Version III, without
modem
24x IDE CD-Rom
Poweredge Basic Setup
3yr same day 4 hr response parts _ onsite labor
24x7
14,352.09 each
Cyclades console dual PM20 local switch
Rack
Total system cost 50K
Expandable in place by addition of processors or
disks within systems.

22
FermiGrid Alternate Hardware Specification

3 x Poweredge 2850 (2U server)
Dual processor 3.6 Xeon, 1MB cache, 800 MHz FSB
Rapid rails for dell rack
4 GB DDR2 400 MHZ 4x1GB
Embedded Perc4ei controller
2x 36Gb 15K RPM drive
2x 73Gb 10K RPM drive
Dual on-board 10/100/1000 nics
Redundant power supply
Dell Remote Access Card, 4th generation
24x IDE CD-Rom
Poweredge Basic Setup
3yr same day 4 hr response 24x7 parts onsite
labor
6,951.24 each
Cyclades console dual PM20 local switch
Rack
Total system cost 25K
Limited CPU expandability can only add whole
systems or perform forklift upgrade.

23
FermiGrid Condor and Condor-G

Condor (Condor-G) will be used for batch queue
management.
Within FermiGrid gateway systems definitely.
May feed into other head node batch systems (eg.
FBS) as necessary.
VOs that own the resource will have priority
access to the resource.
Policy? - guest VOs will only be allowed to
utilize idle/unused resources.
Policy? how quickly must a guest VO free
resource when desired by owner VO?
Condor checkpoint would provide this, but D0 and
CMS jobs will not function in this environment.
Alternatives 24 hour CPU limit?, nice?, other?
More thought required (perhaps helped by policy
decisions above?).
For Condor information see
http//www.cs.wisc.edu/condor/

24
FermiGrid VO Management

Currently VO management is performed via CMS in a
back pocket fashion.
Not a viable solution for the long term.
CMS would probably like to direct that effort
towards their work.
We recommend that FermiGrid infrastructure should
take over the VO Management Server/services and
migrate onto the appropriate gateway system
(FermiGate2).
Existing VOs should be migrated to the new VO
Management Server (in the FermiGrid gateway) once
the FermiGrid gateway is commissioned.
Existing VO management roles delegated to
appropriate members of the current VOs.
New VOs for existing infrastructure clients (eg.
FNAL, CDF, D0, CMS, Lattice QCD, SDSS, others)
should be created as necessary/authorized.

25
FermiGrid VO Creation and Support

All new VOs created on the new VO Management
Server by FermiGrid project personnel or
Helpdesk.
Policy? - VO creation authorization mechanism?
VO management authority delegated to the
appropriate members of the VO.
Policy? - FNAL VO membership administered by
the Helpdesk?
Like accounts in the FNAL Kerberos domain and
Fermi Windows 2000 domain.
Policy? - Small experiments may apply to CD to
have their VO managed by the Helpdesk also?
Need to provide the Helpdesk with the necessary
tools for VO membership management.

26
FermiGrid GUMS

Grid User Management System
Developed at BNL
Translates a Grid identity to a local identity
(certificate -gt local user)
Think of it as an automated mechanism to maintain
the gridmap file.
See
http//www.rhic.bnl.gov/hepix/talks/041018pm/carca
ssi.ppt

27
FermiGrid Project Management

Weekly FermiGrid project management meeting
Fridays from 200 PM to 300 PM in FCC1.
We would like to empanel a set of Godparents
Representatives from
CMS
Run II
Grid Developers?
Security Team?
Other?
Godparent panel would be used to provide (short
term?) guidance and feedback to the FermiGrid
project management team.
Longer term guidance and policy from CD line
management.

28
FermiGrid Time Scale for Implementation

Today Decide and order hardware for gateway
systems
Explore / kick tires on existing software.
Jan 2005 Hardware installation.
Begin software installation and initial
configure.
Feb-Mar 2005 Common Grid services available in
non-redundant mode (Condor-G, VOMS, GUMS,
etc.).
Future Transition to redundant mode as
hardware/software matures.

29
FermiGrid Open Questions