Run II Computing Overview - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

Run II Computing Overview

Description:

CD Shared Resources- For Services, Development, R&D. Farms procurements and administration ... one of facilitating the planning, decision-making and program of ... – PowerPoint PPT presentation

Number of Views:35
Avg rating:3.0/5.0
Slides: 22
Provided by: vick144
Category:

less

Transcript and Presenter's Notes

Title: Run II Computing Overview


1
Run II Computing Overview
  • Victoria White
  • Head, Computing Division
  • Run II Computing Review
  • September 12, 2005

2
10th Review of Run II Computing
  • 1996 - Started Joint Projects for Run II
  • 1997-2001 Building the Run II Computing
    software, services, environment
  • Some successes in commonality, but many diverging
    approaches
  • 2002 Rework of CDF data handling system to use
    Enstore and dcache with a decision to adopt SAM
  • 2002-2005 Evolution of Run II environment
    towards further commonality, scalability and long
    term supportability and towards full distributed
    Grid computing
  • Many great successes - ongoing work
  • Each experiment working (with CD) to get out of a
    couple of holes they have fallen into

3
Run II Computing and Software A collaborative
endeavor
  • CDF and D0 collaborations
  • Computing Division Running Experiments
    department and 3 large departments
  • CEPA (Computing and Engineering for Physics
    Analysis)
  • CCF (Computation and Communications Fabric)
  • CSS (Core Support Services)
  • Grid Projects efforts worldwide
  • Community support for some software (e.g. GEANT,
    ROOT)
  • Computing Centers and institutions with computing
    resources outside Fermilab

4
Computing Division Organization
5
Where are we in this endeavor
  • Successful basically physics results are coming
  • Both experiments can record, reconstruct and
    analyze their data in a reliable and timely way.
  • Both can create and store MC (albeit less than
    optimally)
  • Both can make new releases of software.
  • Both make use of Computing Resources outside
    Fermilab (more on this later in Grid talk)
  • Together we are attacking the remaining problem
    areas
  • Together we are implementing (and relying on) a
    Grid Computing model in order to provide adequate
    resources for the next few years

6
Fermilab Budget for Run II Computing
  • 2.6M of capital equipment money in FY05
  • Expect 3M in FY06
  • 400K towards computing facility upgrades
  • Hope this is covered elsewhere in FY06
  • 1M of operating expenses in FY05
  • For tapes, maintenance of machines, robots,
    network equipment, tape drives (WAN costs
    general facility costs covered elsewhere)
  • 31 FTEs for Run II in REX department
  • A large fraction of the resources for shared
    services and solutions -gt support Run II ( 90
    FTEs)

7
CD Shared Resources- For Services, Development,
RD
  • Farms procurements and administration
  • Central Storage Systems (Enstore, dcache, SRM)
  • SAM-GRID data handling
  • Networking
  • Wide Area Networking
  • Databases infrastructure and operations
  • Engineering Support
  • Equipment logistics and repair
  • Linux support
  • Monitoring infrastructure
  • Computing Facility operations and planning
  • Budget, ESH, DOE reporting, Admin document
    support
  • Cyber Security -gt Grid security
  • Helpdesk, contract management, Windows
    infrastructure
  • Web servers
  • Task forces to help improve performance special
    needs
  • Ongoing program of Grid related projects

8
CD Shared Resources- For Services, Development,
RD
  • Farms procurements and administration
  • Central Storage Systems (Enstore, dcache, SRM)
  • SAM-GRID data handling
  • Networking
  • Wide Area Networking
  • Databases infrastructure and operations
  • Engineering Support
  • Equipment logistics and repair
  • Linux support
  • Monitoring infrastructure
  • Computing Facility operations and planning
  • Budget, ESH, DOE reporting, Admin document
    support
  • Cyber Security -gt Grid security
  • Helpdesk, contract management, Windows
    infrastructure
  • Web servers
  • Task forces to help improve performance special
    needs
  • Ongoing program of Grid related projects

9
CDF dcache bytes read
10
D0 Reprocessing / SAM-Grid - II
  • SAM-Grid enables a common environment
    operation scripts as well as effective
    book-keeping
  • JIMs XML-DB used to ease bug tracing
  • provide fast recovery
  • SAM avoids data duplication
  • defines recovery jobs
  • Monitor speed and efficiency
  • by site or overall
  • (http//samgrid.fnal.gov8080/cgi-bin/plot_efficie
    ncy.cgi)
  • Started end march

11
Strategy and Emphasis of CD
  • Address scalability and reliability issues (as
    they become understood)
  • Requirements are a moving target
  • Common solutions at Fermilab and worldwide
    wherever possible (arguments are weak for
    experiment needs being special)
  • Grid solutions leverage use of computing and
    human resources and assure interoperability with
    LHC experiments
  • Increase efficiency of operations through common
    services, automation, better documentation and
    monitoring
  • Task forces to address specific needs
  • Plan to find a few places to take over tasks

12
Strategy and Emphasis of CD
  • Address scalability and reliability issues (as
    they become understood) TALKS FROM BOTH EXPTS
  • Requirements are a moving target
  • Common solutions at Fermilab and worldwide
    wherever possible (arguments are weak for
    experiment needs being special) 2 TALKS FROM CD
    missing CEPA talk this year on Engineering and
    Physics Tools help
  • Grid solutions leverage use of computing and
    human resources and assure interoperability with
    LHC experiments - TALK ON GRID STRATEGIES FROM
    RUTH
  • Increase efficiency of operations through common
    services, automation, better documentation and
    monitoring HOPE THIS WILL EMERGE THROUGHOUT THE
    TALKS
  • Task forces to address specific needs - 2 TALKS
  • Plan to find a few places to take over tasks

13
ROOT Analysis
Production Scripts
Batch Analysis
Experiment User Applications
Event Data Management Selection
Physics Allocations Accounting
Experiment User Interface Frameworks
Experiment Specific
Resource Selection
Workflow Management
Virtual Organization (Physics Group) Administratio
n
Common Middleware Services
Data/File Catalogs Data Handling
Job Queues Workload Mgmt
Information Catalogs Repositories
Data Movement Bandwidth Scheduling
Job Scheduling Priority
Monitoring Information
Security Authorization
Grid Middleware Interfaces
Local Network
Resources
Disks Farms Storage Batch Queues Compute
Elements local Storage Elements
Permanent Storage Storage Elements
14
Task Forces/Special Assignments
  • D0 Reconstruction Code speedup
  • Qizhong Li will give a talk on this
  • SAM at CDF and common SAM for long-term support
    at both experiments
  • Gerry Guglielmo assigned for 6 months to help
    make this happen.
  • FermiGrid
  • Ruth Pordes will address this in her Grid talk
  • CDF online administration brought into offline
    system administration. D0 in progress now
  • Stephan Lammel leading this effort
  • CPU procurements task force new economic model
  • Steve Wolbers will talk on this
  • CDF HV system problems engineers working on
    this
  • D0 Trigger Database -gt greater CD involvement in
    database applications support for both expts
  • To begin in next 2 weeks
  • ? New areas of need we expect more

15
SAM
  • Much progress at CDF (rework of production Farm
    scripts using SAM), offsite usage, some onsite
    usage
  • But still no cigar
  • SAM used by MINOS also. Running smoothly at D0.
  • There have been real problems in the deployment
    of SAM at CDF
  • Problems both on SAM team side (attitude,
    testing, understanding requirements of CDF) and
    on CDF experiment side (attitude, inability to
    articulate requirements, apparent inability to
    make and follow through on a plan)
  • CDF have made some changes more leadership
  • SAM team leadership has changed Adam Lyon has
    taken on this job
  • Some staff changes were made in the REX
    department
  • Plus - I appointed a person, Gerry Guglielmo,
    (reporting to me) to facilitate making and
    executing a plan to get SAM working at CDF and in
    shape for long term (common) support at CDF and
    D0
  • Charge
  • I will leave it to the experiment talks and the
    SAM talks to present the current situation and to
    the reviewers to determine status and prognosis

16
Grid at Fermilab
  • FermiGrid
  • Open Science Grid
  • CMS Tier 1 center and LHC Computing Grid
  • Particle Physics Data Grid project
  • SAM-Grid
  • iVDGL project
  • International Lattice Data Grid
  • ..

17
(No Transcript)
18
Computing Facilities and Networking
  • Thanks to an enormous amount of hard work by
    Gerry Bellendir and the Facilities Engineering
    Section plus tremendous support from the
    Directorate we now have a functioning Grid
    Computing Facility and plans for adding space,
    power, cooling, robotics for the next few years
  • Anyone want a tour? Or more details ? Let me
    know
  • We also have our own network link to Starlight
    which is working well and thanks to pressure on
    DOE/SC, collaboration with ESNet, ANL, NIU and
    Batavia plans in place for a Chicago Metropolitan
    Area Network between ANL, FNAL and Starlight
  • You will hear more about this and about ongoing
    WAN RD efforts on Thursday

19
Computer Security
  • Increasingly hostile internet environment
  • Increasing pressure to improve our cyber security
    program and stance
  • Rewrite of entire program plan
  • Assist visit in November with penetration
    testing
  • Grid Security even more complicated
  • Working to maintain Open Science
  • Will be more work for everyone in the next year

20
Summary
  • Overall things are working well and
    collaborations and CD are all working together
  • Diminishing resources in experiments and CD is
    stressful and requires cultural changes
  • Common and Grid solutions require more thought
    and specification of needs with less emphasis
    on experiment-driven end to end solutions
  • Scalability and interoperability is being
    addressed although we may still have some
    surprises
  • Computing continues to get cheaper, people more
    expensive - we have facility infrastructure in
    place and planned for the future and will
    maintain CD staff for Run II nearly flat
  • We are on a track for successful support of Run
    II through 2009 and beyond

21
SAM Special assignment charge
  • To Jerry Guglielmo
  • From Vicky White
  • Subject Charge for Special Assignment Draft 2
    2/24/05
  • I am asking you to carry out a special assignment
    for a limited period of time, but not longer than
    through September 30th 2005. There are two main
    goals for this assignment.
  • The first goal is to get the CDF experiment fully
    operational, in production mode, using SAM as
    their data handling system for raw event
    recording through to individual user analyses
    on-site and off-site. This needs to be achieved
    in a manner that uses SAM in an appropriate
    fashion and integrates well with the storage
    systems, production farm processing for CDF and
    the CAF software.
  • The second goal is to assure that both CDF and D0
    data handling operations are well documented and
    understood and use common installations and
    operational approaches where feasible and
    reasonable. This will help the Run II department
    to support both CDF and D0 data handling in a
    consistent and efficient way into the future.
  • In carrying out this assignment you will report
    directly to me. You will have no staff reporting
    to you in this assignment. You will need to work
    closely with the CDF and D0 Computing
    Coordinators (Ashutosh and Gustaaf), who will be
    responsible for experiment-related decisions and
    requirements and for getting experiment
    participants to carry out work required such as
    in the experiment framework code, CAF code, or
    Farms environment codes that are not the
    responsibility of the Run II department staff nor
    the SAM project team. You will need to work
    closely with the management of the Run II
    department (Amber and Rick) who will be
    responsible for operational decisions and
    requirements and for any changes in assignments
    of Run II staff. In particular you will need to
    work closely with the new data handling group
    leader Krzysztof Genser on operational issues.
    You will need to work closely with the SAM
    project leader (Adam) who will be responsible for
    decisions and work plans for SAM subprojects
    required in order to succeed on the above goals.
    You may need also to work closely with the CCF
    department management and the leaders of the
    Upper (or possibly Lower) Storage work areas, who
    will be responsible for decisions and work plans
    for storage system work.
  • The above goals imply a coordinated program of
    work for the short, medium and long term. The job
    is therefore one of facilitating the planning,
    decision-making and program of work necessary to
    achieve the goals.
  • In order to do this you will need to
  • Carry out a fact-finding mission. In particular
    understand and assess the status of data handling
    both architecturally and operationally at CDF and
    D0. Highlight potential issues where the
    requirements are unclear, where implementation
    decisions may be not clearly related to
    requirements, where operational practices and
    decisions may be less than optimal, and where
    potentially unnecessary differences in approaches
    have been taken at CDF and D0. Please note, this
    does not mean that we expect CDF and D0 systems
    to be exactly alike or use all of the components
    of SAM, dcache, and Enstore in identical manners.
    We do hope to end up with systems that are
    appropriate to the requirements and that use the
    underlying components of data handling and
    storage in appropriate ways.
  • Make a plan for what needs to be done to achieve
    the goals. This will involve working closely
    with all of the stakeholders, department
    management and project leaders listed above to
    develop a realistic plan.
  • Work with all of the above stakeholders, project
    leaders and department management to get
    resources allocated on the work items needed to
    achieve the plan. Monitor progress on all this
    work that needs to come together to achieve the
    goals. Adjust the plan as necessary.
  • Carry out some parts of the work plan yourself in
    cases where your expertise and knowledge can be
    used to maximum benefit to move the whole plan
    forward rapidly (such as in the online area at
    CDF and D0)
  • You will need to make regular reports to me and
    all those involved as listed above. This should
    be done via the GDM forum. In cases of dispute
    or where we are stalled in moving forward because
    of lack of either agreement or resources you will
    need to bring the issue to this forum for
    resolution.
  • Jerry, this is not an easy assignment, but it is
    a terribly important one. I believe that all of
    the experiment stakeholders, the SAM project
    leader, the Storage systems people and the Run II
    department management all want this extra
    assistance to get everything working in an
    optimal and supportable fashion and they welcome
    the energy and dedicated focus that you will
    bring to this assignment.
  • Thank you for taking it on.
Write a Comment
User Comments (0)
About PowerShow.com