FermiGrid - PowerPoint PPT Presentation

About This Presentation

Title:

FermiGrid

Description:

User obtains a voms-qualified proxy in the normal fashion with voms-proxy-init ... Also needed, shared home directory for all users, (fermigrid has 226) ... – PowerPoint PPT presentation

Number of Views:59

Avg rating:3.0/5.0

Slides: 22

Provided by: Csw5

Learn more at: https://research.cs.wisc.edu

Category:

Tags: fermigrid

more less

Transcript and Presenter's Notes

Title: FermiGrid

1
FermiGrid

Steven Timm
Fermilab
Computing Division
Fermilab Grid Support Center

2
People

FermiGrid Operations Team
Keith Chadwick (CD/CCF/FTP) Project Leader
Steve Timm (CD/CSS/FCS) Linux OS Support
Dan Yocum (CD/CCF/FTP) Application Support
Thanks to
Condor Team M. Livny, J. Frey, A. Roy and many
others.
Globus Developers C. Bacon, S. Martin.
GridX1 R. Walker, D. Vanderster et al.
Fermilab grid developers G. Garzoglio, T.
Levshina.
Representatives of following OSG Virtual
Organizations CDF, DZERO, USCMS, DES, SDSS,
FERMILAB, I2U2, NANOHUB, GADU.
FermiGrid Web Site Additional Documentation
http//fermigrid.fnal.gov//

3
FCCFeynman Computing Center
4
Fermilab Grid Computing Center
5
Computing at Fermilab

Reconstruction and analysis of data for High
Energy Physics Experiments
gt 4 Petabytes on tape
Fast I/O to read file, many hours of computing,
fast I/O to write
Each job independent of other jobs.
Simulation for future experiments (CMS at CERN)
In two years need to scale to gt50K jobs/day
Each big experiment has independent cluster or
clusters
Diverse file systems, batch systems, management
methods.
More than 3000 dual-processor Linux systems in
all

6
FermiGrid Project

FermiGrid is a meta-facility established by
Fermilab Computing Division
Four elements
Common Site Grid Services
Virtual Organization hosting (VOMS, VOMRS),
Site-wide Globus GRAM gateway, Site
AuthoriZation, MyProxy, GUMS.
Bi-lateral Interoperability between various
experimental stakeholders
Interfaces to the Open Science Grid
Grid interfaces to mass storage systems.

7
(No Transcript)
8
Hardware
Dell 2850 Servers with dual 3.6 GHz Xeons,
4Gbytes of memory, 1000TX, Hardware Raid,
Scientific Linux 3.0.4, VDT 1.3.9 FermiGrid1 Sit
e Wide Globus Gateway FermiGrid2 Site Wide VOMS
VOMRS Server FermiGrid3 Site Wide GUMS
Server FermiGrid4 Myproxy server Site
AuthoriZation server
9
(No Transcript)
10
Site Wide Gateway Technique

This technique is closely adapted from a
technique first used at GridX1 in Canada to
forward jobs from the LCG into their clusters.
We begin by creating a new Job Manager script in
VDT_LOCATION/globus/lib/perl/Globus/GRAM/JobMana
ger/condorg.pm
This script takes incoming jobs and resubmits
them to Condor-G on fermigrid1
Condor matchmaking is used so that the jobs will
be forwarded to the member cluster with the most
open slots.
Each member cluster runs a cron job every five
minutes to generate a ClassAD for their cluster.
This is sent to fermigrid1 using
condor_advertise.
Credentials to successfully forward the job are
obtained in the following manner
User obtains a voms-qualified proxy in the normal
fashion with voms-proxy-init
User sets X509_USER_CERT and X509_USER_KEY to
point to the proxy instead of the usercert.pem
and userkey.pem files
User uses myproxy-init to store the credentials,
using myproxy, on the fermilab myproxy server
myproxy.fnal.gov
jobmanager-condorg, which is running as the uid
that the job will run on under fermigrid,
executes a myproxy-get-delegation to get a proxy
with full rights to resubmit the job.
Documentation of the steps to do this as a user
is found in the Fermigrid User Guide
http//fermigrid.fnal.gov/user-guide.html

11
(No Transcript)
12
(No Transcript)
13
OSG Interfaces for Fermilab

Four Fermilab clusters are directly accessible to
OSG right now,
General Purpose Grid Cluster (FNAL_GPFARM)
US CMS Tier 1 Cluster (USCMS_FNAL_WC1_CE)
LQCD cluster (FNAL_LQCD)
SDSS cluster (SDSS_TAM)
Two more clusters (CDF) accessible only by
Fermigrid site gateway.
Future Fermilab clusters will also only be
accessible by Fermigrid site gateway.
Shell script is used to make a condor classad and
send it with condor_advertise
Match is done based on number of free cpus and
number of jobs waiting

14
OSG Requirements

OSG Job flow
User pre-stages applications and data via
gridftp/srmcp to shared areas on cluster (can be
NFS or SRM-based storage element.)
User submits a set of jobs to cluster
Jobs take applications and data from cluster-wide
shared directories.
Results are written to local storage on cluster,
then transferred across WAN
Most OSG jobs expect common shared disk areas for
applications, data, and user home directories.
Our clusters are currently not shared.
Most OSG jobs dont use myproxy in submission
sequence
OSG makes use of monitoring to detect free
resources, ours are not currently reported
correctly.
Need to make the gateway transparent to the OSG
so it looks like any other OSG resource. Right
now it only reports 4 CPUs.
Want to add possibility for VO affinity to the
classad advertising of the gateway.

15
(No Transcript)
16
Shared data areas and storage elements

At the moment OSG requires shared Application and
Data areas
Also needed, shared home directory for all users,
(fermigrid has 226).
It is planned to use a BlueArc NAS appliance to
serve these to all the member clusters of
FermiGrid. 24TB of disk is in process of being
ordered. NAS head already in hand.
Also being commissioned, a shared volatile
Storage Element for fermigrid, supports
SRM/dCache access for all grid users.

17
Getting rid of MyProxy

Configure each individual cluster gatekeeper to
accept restricted globus proxyfrom just one
host, the site gateway.
On CDF clusters for example the gatekeeper is
already restricted via tcp-wrappers to not take
any connections from off-site. Could be
restricted further to take connections only from
glidecaf head and fermigrid1.
Then change gatekeeper configuration, call it
with
accept_limited option, we would then be able
to forward jobs without myproxy, and could call
this the jobmanager-condor rather than the
jobmanager-condorg. This has been tested in our
test cluster, will move to production soon.

18
Reporting all resources

MonALISA?just need a unified Ganglia view of all
Fermigrid and MonALISA will show right number of
cpus, etc. Also make it so MonALISA queries all
condor pools in Fermigrid
GridCat/ACDC-gt have to change condor subroutines
in MIS-CI to get the right total number of CPUs
from the cluster classads. Fairly
straightforward
GIP-gt Need to change lcg-info-dynamic-condor
script to do the right number of job slots per
VO. Already had to do this once, not difficult.