Title: Run II Computing Overview
1Run II Computing Overview
- Victoria White
- Head, Computing Division
- Run II Computing Review
- September 12, 2005
210th Review of Run II Computing
- 1996 - Started Joint Projects for Run II
- 1997-2001 Building the Run II Computing
software, services, environment - Some successes in commonality, but many diverging
approaches - 2002 Rework of CDF data handling system to use
Enstore and dcache with a decision to adopt SAM - 2002-2005 Evolution of Run II environment
towards further commonality, scalability and long
term supportability and towards full distributed
Grid computing - Many great successes - ongoing work
- Each experiment working (with CD) to get out of a
couple of holes they have fallen into
3Run II Computing and Software A collaborative
endeavor
- CDF and D0 collaborations
- Computing Division Running Experiments
department and 3 large departments - CEPA (Computing and Engineering for Physics
Analysis) - CCF (Computation and Communications Fabric)
- CSS (Core Support Services)
- Grid Projects efforts worldwide
- Community support for some software (e.g. GEANT,
ROOT) - Computing Centers and institutions with computing
resources outside Fermilab
4Computing Division Organization
5Where are we in this endeavor
- Successful basically physics results are coming
- Both experiments can record, reconstruct and
analyze their data in a reliable and timely way. - Both can create and store MC (albeit less than
optimally) - Both can make new releases of software.
- Both make use of Computing Resources outside
Fermilab (more on this later in Grid talk) - Together we are attacking the remaining problem
areas - Together we are implementing (and relying on) a
Grid Computing model in order to provide adequate
resources for the next few years
6Fermilab Budget for Run II Computing
- 2.6M of capital equipment money in FY05
- Expect 3M in FY06
- 400K towards computing facility upgrades
- Hope this is covered elsewhere in FY06
- 1M of operating expenses in FY05
- For tapes, maintenance of machines, robots,
network equipment, tape drives (WAN costs
general facility costs covered elsewhere) - 31 FTEs for Run II in REX department
- A large fraction of the resources for shared
services and solutions -gt support Run II ( 90
FTEs)
7CD Shared Resources- For Services, Development,
RD
- Farms procurements and administration
- Central Storage Systems (Enstore, dcache, SRM)
- SAM-GRID data handling
- Networking
- Wide Area Networking
- Databases infrastructure and operations
- Engineering Support
- Equipment logistics and repair
- Linux support
- Monitoring infrastructure
- Computing Facility operations and planning
- Budget, ESH, DOE reporting, Admin document
support - Cyber Security -gt Grid security
- Helpdesk, contract management, Windows
infrastructure - Web servers
- Task forces to help improve performance special
needs - Ongoing program of Grid related projects
8CD Shared Resources- For Services, Development,
RD
- Farms procurements and administration
- Central Storage Systems (Enstore, dcache, SRM)
- SAM-GRID data handling
- Networking
- Wide Area Networking
- Databases infrastructure and operations
- Engineering Support
- Equipment logistics and repair
- Linux support
- Monitoring infrastructure
- Computing Facility operations and planning
- Budget, ESH, DOE reporting, Admin document
support - Cyber Security -gt Grid security
- Helpdesk, contract management, Windows
infrastructure - Web servers
- Task forces to help improve performance special
needs - Ongoing program of Grid related projects
9CDF dcache bytes read
10D0 Reprocessing / SAM-Grid - II
- SAM-Grid enables a common environment
operation scripts as well as effective
book-keeping - JIMs XML-DB used to ease bug tracing
- provide fast recovery
- SAM avoids data duplication
- defines recovery jobs
- Monitor speed and efficiency
- by site or overall
- (http//samgrid.fnal.gov8080/cgi-bin/plot_efficie
ncy.cgi) - Started end march
11Strategy and Emphasis of CD
- Address scalability and reliability issues (as
they become understood) - Requirements are a moving target
- Common solutions at Fermilab and worldwide
wherever possible (arguments are weak for
experiment needs being special) - Grid solutions leverage use of computing and
human resources and assure interoperability with
LHC experiments - Increase efficiency of operations through common
services, automation, better documentation and
monitoring - Task forces to address specific needs
- Plan to find a few places to take over tasks
12Strategy and Emphasis of CD
- Address scalability and reliability issues (as
they become understood) TALKS FROM BOTH EXPTS - Requirements are a moving target
- Common solutions at Fermilab and worldwide
wherever possible (arguments are weak for
experiment needs being special) 2 TALKS FROM CD
missing CEPA talk this year on Engineering and
Physics Tools help - Grid solutions leverage use of computing and
human resources and assure interoperability with
LHC experiments - TALK ON GRID STRATEGIES FROM
RUTH - Increase efficiency of operations through common
services, automation, better documentation and
monitoring HOPE THIS WILL EMERGE THROUGHOUT THE
TALKS - Task forces to address specific needs - 2 TALKS
- Plan to find a few places to take over tasks
13ROOT Analysis
Production Scripts
Batch Analysis
Experiment User Applications
Event Data Management Selection
Physics Allocations Accounting
Experiment User Interface Frameworks
Experiment Specific
Resource Selection
Workflow Management
Virtual Organization (Physics Group) Administratio
n
Common Middleware Services
Data/File Catalogs Data Handling
Job Queues Workload Mgmt
Information Catalogs Repositories
Data Movement Bandwidth Scheduling
Job Scheduling Priority
Monitoring Information
Security Authorization
Grid Middleware Interfaces
Local Network
Resources
Disks Farms Storage Batch Queues Compute
Elements local Storage Elements
Permanent Storage Storage Elements
14Task Forces/Special Assignments
- D0 Reconstruction Code speedup
- Qizhong Li will give a talk on this
- SAM at CDF and common SAM for long-term support
at both experiments - Gerry Guglielmo assigned for 6 months to help
make this happen. - FermiGrid
- Ruth Pordes will address this in her Grid talk
- CDF online administration brought into offline
system administration. D0 in progress now - Stephan Lammel leading this effort
- CPU procurements task force new economic model
- Steve Wolbers will talk on this
- CDF HV system problems engineers working on
this - D0 Trigger Database -gt greater CD involvement in
database applications support for both expts - To begin in next 2 weeks
- ? New areas of need we expect more
15SAM
- Much progress at CDF (rework of production Farm
scripts using SAM), offsite usage, some onsite
usage - But still no cigar
- SAM used by MINOS also. Running smoothly at D0.
- There have been real problems in the deployment
of SAM at CDF - Problems both on SAM team side (attitude,
testing, understanding requirements of CDF) and
on CDF experiment side (attitude, inability to
articulate requirements, apparent inability to
make and follow through on a plan) - CDF have made some changes more leadership
- SAM team leadership has changed Adam Lyon has
taken on this job - Some staff changes were made in the REX
department - Plus - I appointed a person, Gerry Guglielmo,
(reporting to me) to facilitate making and
executing a plan to get SAM working at CDF and in
shape for long term (common) support at CDF and
D0 - Charge
- I will leave it to the experiment talks and the
SAM talks to present the current situation and to
the reviewers to determine status and prognosis
16Grid at Fermilab
- FermiGrid
- Open Science Grid
- CMS Tier 1 center and LHC Computing Grid
- Particle Physics Data Grid project
- SAM-Grid
- iVDGL project
- International Lattice Data Grid
- ..
17(No Transcript)
18Computing Facilities and Networking
- Thanks to an enormous amount of hard work by
Gerry Bellendir and the Facilities Engineering
Section plus tremendous support from the
Directorate we now have a functioning Grid
Computing Facility and plans for adding space,
power, cooling, robotics for the next few years - Anyone want a tour? Or more details ? Let me
know - We also have our own network link to Starlight
which is working well and thanks to pressure on
DOE/SC, collaboration with ESNet, ANL, NIU and
Batavia plans in place for a Chicago Metropolitan
Area Network between ANL, FNAL and Starlight - You will hear more about this and about ongoing
WAN RD efforts on Thursday
19Computer Security
- Increasingly hostile internet environment
- Increasing pressure to improve our cyber security
program and stance - Rewrite of entire program plan
- Assist visit in November with penetration
testing - Grid Security even more complicated
- Working to maintain Open Science
- Will be more work for everyone in the next year
20Summary
- Overall things are working well and
collaborations and CD are all working together - Diminishing resources in experiments and CD is
stressful and requires cultural changes - Common and Grid solutions require more thought
and specification of needs with less emphasis
on experiment-driven end to end solutions - Scalability and interoperability is being
addressed although we may still have some
surprises - Computing continues to get cheaper, people more
expensive - we have facility infrastructure in
place and planned for the future and will
maintain CD staff for Run II nearly flat - We are on a track for successful support of Run
II through 2009 and beyond
21SAM Special assignment charge
- To Jerry Guglielmo
- From Vicky White
- Subject Charge for Special Assignment Draft 2
2/24/05 - I am asking you to carry out a special assignment
for a limited period of time, but not longer than
through September 30th 2005. There are two main
goals for this assignment. - The first goal is to get the CDF experiment fully
operational, in production mode, using SAM as
their data handling system for raw event
recording through to individual user analyses
on-site and off-site. This needs to be achieved
in a manner that uses SAM in an appropriate
fashion and integrates well with the storage
systems, production farm processing for CDF and
the CAF software. - The second goal is to assure that both CDF and D0
data handling operations are well documented and
understood and use common installations and
operational approaches where feasible and
reasonable. This will help the Run II department
to support both CDF and D0 data handling in a
consistent and efficient way into the future. - In carrying out this assignment you will report
directly to me. You will have no staff reporting
to you in this assignment. You will need to work
closely with the CDF and D0 Computing
Coordinators (Ashutosh and Gustaaf), who will be
responsible for experiment-related decisions and
requirements and for getting experiment
participants to carry out work required such as
in the experiment framework code, CAF code, or
Farms environment codes that are not the
responsibility of the Run II department staff nor
the SAM project team. You will need to work
closely with the management of the Run II
department (Amber and Rick) who will be
responsible for operational decisions and
requirements and for any changes in assignments
of Run II staff. In particular you will need to
work closely with the new data handling group
leader Krzysztof Genser on operational issues.
You will need to work closely with the SAM
project leader (Adam) who will be responsible for
decisions and work plans for SAM subprojects
required in order to succeed on the above goals.
You may need also to work closely with the CCF
department management and the leaders of the
Upper (or possibly Lower) Storage work areas, who
will be responsible for decisions and work plans
for storage system work. - The above goals imply a coordinated program of
work for the short, medium and long term. The job
is therefore one of facilitating the planning,
decision-making and program of work necessary to
achieve the goals. - In order to do this you will need to
- Carry out a fact-finding mission. In particular
understand and assess the status of data handling
both architecturally and operationally at CDF and
D0. Highlight potential issues where the
requirements are unclear, where implementation
decisions may be not clearly related to
requirements, where operational practices and
decisions may be less than optimal, and where
potentially unnecessary differences in approaches
have been taken at CDF and D0. Please note, this
does not mean that we expect CDF and D0 systems
to be exactly alike or use all of the components
of SAM, dcache, and Enstore in identical manners.
We do hope to end up with systems that are
appropriate to the requirements and that use the
underlying components of data handling and
storage in appropriate ways. - Make a plan for what needs to be done to achieve
the goals. This will involve working closely
with all of the stakeholders, department
management and project leaders listed above to
develop a realistic plan. - Work with all of the above stakeholders, project
leaders and department management to get
resources allocated on the work items needed to
achieve the plan. Monitor progress on all this
work that needs to come together to achieve the
goals. Adjust the plan as necessary. - Carry out some parts of the work plan yourself in
cases where your expertise and knowledge can be
used to maximum benefit to move the whole plan
forward rapidly (such as in the online area at
CDF and D0) - You will need to make regular reports to me and
all those involved as listed above. This should
be done via the GDM forum. In cases of dispute
or where we are stalled in moving forward because
of lack of either agreement or resources you will
need to bring the issue to this forum for
resolution. - Jerry, this is not an easy assignment, but it is
a terribly important one. I believe that all of
the experiment stakeholders, the SAM project
leader, the Storage systems people and the Run II
department management all want this extra
assistance to get everything working in an
optimal and supportable fashion and they welcome
the energy and dedicated focus that you will
bring to this assignment. - Thank you for taking it on.