Title: FermiGrid
1FermiGrid
- Steven Timm
- Fermilab
- Computing Division
- Fermilab Grid Support Center
2People
- FermiGrid Operations Team
- Keith Chadwick (CD/CCF/FTP) Project Leader
- Steve Timm (CD/CSS/FCS) Linux OS Support
- Dan Yocum (CD/CCF/FTP) Application Support
- Thanks to
- Condor Team M. Livny, J. Frey, A. Roy and many
others. - Globus Developers C. Bacon, S. Martin.
- GridX1 R. Walker, D. Vanderster et al.
- Fermilab grid developers G. Garzoglio, T.
Levshina. - Representatives of following OSG Virtual
Organizations CDF, DZERO, USCMS, DES, SDSS,
FERMILAB, I2U2, NANOHUB, GADU. - FermiGrid Web Site Additional Documentation
- http//fermigrid.fnal.gov//
3FCCFeynman Computing Center
4Fermilab Grid Computing Center
5Computing at Fermilab
- Reconstruction and analysis of data for High
Energy Physics Experiments - gt 4 Petabytes on tape
- Fast I/O to read file, many hours of computing,
fast I/O to write - Each job independent of other jobs.
- Simulation for future experiments (CMS at CERN)
- In two years need to scale to gt50K jobs/day
- Each big experiment has independent cluster or
clusters - Diverse file systems, batch systems, management
methods. - More than 3000 dual-processor Linux systems in
all -
6FermiGrid Project
- FermiGrid is a meta-facility established by
Fermilab Computing Division - Four elements
- Common Site Grid Services
- Virtual Organization hosting (VOMS, VOMRS),
Site-wide Globus GRAM gateway, Site
AuthoriZation, MyProxy, GUMS. - Bi-lateral Interoperability between various
experimental stakeholders - Interfaces to the Open Science Grid
- Grid interfaces to mass storage systems.
-
7(No Transcript)
8Hardware
Dell 2850 Servers with dual 3.6 GHz Xeons,
4Gbytes of memory, 1000TX, Hardware Raid,
Scientific Linux 3.0.4, VDT 1.3.9 FermiGrid1 Sit
e Wide Globus Gateway FermiGrid2 Site Wide VOMS
VOMRS Server FermiGrid3 Site Wide GUMS
Server FermiGrid4 Myproxy server Site
AuthoriZation server
9(No Transcript)
10Site Wide Gateway Technique
- This technique is closely adapted from a
technique first used at GridX1 in Canada to
forward jobs from the LCG into their clusters. - We begin by creating a new Job Manager script in
- VDT_LOCATION/globus/lib/perl/Globus/GRAM/JobMana
ger/condorg.pm - This script takes incoming jobs and resubmits
them to Condor-G on fermigrid1 - Condor matchmaking is used so that the jobs will
be forwarded to the member cluster with the most
open slots. - Each member cluster runs a cron job every five
minutes to generate a ClassAD for their cluster.
This is sent to fermigrid1 using
condor_advertise. - Credentials to successfully forward the job are
obtained in the following manner - User obtains a voms-qualified proxy in the normal
fashion with voms-proxy-init - User sets X509_USER_CERT and X509_USER_KEY to
point to the proxy instead of the usercert.pem
and userkey.pem files - User uses myproxy-init to store the credentials,
using myproxy, on the fermilab myproxy server
myproxy.fnal.gov - jobmanager-condorg, which is running as the uid
that the job will run on under fermigrid,
executes a myproxy-get-delegation to get a proxy
with full rights to resubmit the job. - Documentation of the steps to do this as a user
is found in the Fermigrid User Guide
http//fermigrid.fnal.gov/user-guide.html
11(No Transcript)
12(No Transcript)
13OSG Interfaces for Fermilab
- Four Fermilab clusters are directly accessible to
OSG right now, - General Purpose Grid Cluster (FNAL_GPFARM)
- US CMS Tier 1 Cluster (USCMS_FNAL_WC1_CE)
- LQCD cluster (FNAL_LQCD)
- SDSS cluster (SDSS_TAM)
- Two more clusters (CDF) accessible only by
Fermigrid site gateway. - Future Fermilab clusters will also only be
accessible by Fermigrid site gateway. - Shell script is used to make a condor classad and
send it with condor_advertise - Match is done based on number of free cpus and
number of jobs waiting
14OSG Requirements
- OSG Job flow
- User pre-stages applications and data via
gridftp/srmcp to shared areas on cluster (can be
NFS or SRM-based storage element.) - User submits a set of jobs to cluster
- Jobs take applications and data from cluster-wide
shared directories. - Results are written to local storage on cluster,
then transferred across WAN - Most OSG jobs expect common shared disk areas for
applications, data, and user home directories.
Our clusters are currently not shared. - Most OSG jobs dont use myproxy in submission
sequence - OSG makes use of monitoring to detect free
resources, ours are not currently reported
correctly. - Need to make the gateway transparent to the OSG
so it looks like any other OSG resource. Right
now it only reports 4 CPUs. - Want to add possibility for VO affinity to the
classad advertising of the gateway.
15(No Transcript)
16Shared data areas and storage elements
- At the moment OSG requires shared Application and
Data areas - Also needed, shared home directory for all users,
(fermigrid has 226). - It is planned to use a BlueArc NAS appliance to
serve these to all the member clusters of
FermiGrid. 24TB of disk is in process of being
ordered. NAS head already in hand. - Also being commissioned, a shared volatile
Storage Element for fermigrid, supports
SRM/dCache access for all grid users.
17Getting rid of MyProxy
- Configure each individual cluster gatekeeper to
accept restricted globus proxyfrom just one
host, the site gateway. - On CDF clusters for example the gatekeeper is
already restricted via tcp-wrappers to not take
any connections from off-site. Could be
restricted further to take connections only from
glidecaf head and fermigrid1. - Then change gatekeeper configuration, call it
with - accept_limited option, we would then be able
to forward jobs without myproxy, and could call
this the jobmanager-condor rather than the
jobmanager-condorg. This has been tested in our
test cluster, will move to production soon.
18Reporting all resources
- MonALISA?just need a unified Ganglia view of all
Fermigrid and MonALISA will show right number of
cpus, etc. Also make it so MonALISA queries all
condor pools in Fermigrid - GridCat/ACDC-gt have to change condor subroutines
in MIS-CI to get the right total number of CPUs
from the cluster classads. Fairly
straightforward - GIP-gt Need to change lcg-info-dynamic-condor
script to do the right number of job slots per
VO. Already had to do this once, not difficult.
19Globus Gatekeeper Calls
20VOMS access
21GUMS user mappings