D0 Grid Data Production Initiative: Phase 1 to Phase 2 Transition - PowerPoint PPT Presentation

1 / 16
About This Presentation
Title:

D0 Grid Data Production Initiative: Phase 1 to Phase 2 Transition

Description:

... up, starting Reco) vs. Reco CPU consumption (small-ish effect) ... Purchase Req out in late March for more CPUs... will be in service towards end of summer. ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 17
Provided by: cddocd
Category:

less

Transcript and Presenter's Notes

Title: D0 Grid Data Production Initiative: Phase 1 to Phase 2 Transition


1
D0 Grid Data Production InitiativePhase 1 to
Phase 2 Transition
  • Version 1.3
  • (v1.0 presented to D0 Spokes, CD Mgmt 06
    February 2009)
  • Presented to D0 CPB 27 February 2009
  • Rob Kennedy and Adam Lyon

2
Outline
  • Background (historical reference)
  • Overview, Major Issues, Roadmap
  • Phase 1 Summary
  • Work Done and Outcome (as seen with more
    experience)
  • Assessment
  • Capacity Model, cpu/event f(L)
  • Phase 2 Work Plan
  • Work List Outline
  • Capacity Timeline skeleton draft

3
Initiative Overview(Sep 2008 presentation with
updates in Green)
  • Initiative is an Umbrella Project to achieve a
    broad set of goals
  • Scope D0 Grid Data Production (taking MC
    Production into consideration)
  • Charge
  • Evaluate D0 Grid Data Production, especially
    Resource Utilization by end Sep 08 DONE
  • Create and execute a Work Plan to achieve goal
    Phase 1 DONE
  • Goal Stable Grid Data Production operations that
    efficiently utilizes the resources available.
    DONE for conditions at beginning of Initiative.
    Phase 2 to address evolving conditions.
  • Constraints Achieve improvements ASARP. No
    explicit end date or staff level limits set.
  • Initiative Team (Execution Phase) October 2008 ?
    present
  • Project Manager Rob Kennedy CD OPMQA
  • Project Co-Manager Adam Lyon D0 Collab and CD
    SCF/REX/PS Group Leader
  • Communication with broad set of stakeholders
    Weekly meeting Thursdays at 9am
  • D0 Production Coordinators Mike Diesburg, Joel
    Snow D0 Collaborators Chip Brock, Qizhong Li
  • CD FermiGrid Svcs (Steve Timm, Keith Chadwick),
    SAM-Grid Dev (Gabriele Garzoglio, Parag
    Mhashilkar, Andrew Baranovski), REX (Robert
    Illingworth, Joe Boyd), SCF Mgmt (Margaret
    Votava, Eileen Berman), Fermi Expt Facilities
    (Jason Allen, Glenn Cooper)
  • OSG Abhishek Singh Rana
  • Documentation Home http//d0db-prd.fnal.gov/rexip
    edia/common/SAMGridD0/GDPEval

4
Major Issues(Sep 2008 presentation)
  • Resource Utilization is lower than expected for a
    production system (the motivating concern).
  • CPUs allotted to Data Production use are not kept
    busy, but there are jobs data are available be
    run.
  • Causes Shallow queues must be refilled often.
    Something is leading to slow filling of open CPU
    slots.
  • D0 Grid System Uptime, First-Time Success Rates
    are lower than expected for production.
  • Leads to re-running of jobs and/or manual
    checking of job records to determine
    success/failure
  • Causes Grid Batch System bugs (some known to be
    fixed in Condor 7), Context Event Server
    failures,
  • D0 Grid System requires too much effort for
    customer (D0 Production Coordinator) to use.
  • Hours per day looking at failed jobs or if jobs
    failed. 1-2 touches per day to keep queues full
    (w/scripts).
  • Sum of the above significantly reduces the
    MEvents/day that D0 actually reconstructs.
  • Mike Diesburg estimates (Sep 2008), confirmed by
    historical record, BEFORE the Initiative
  • Max capacity of current system 10 MEvents/day
    (million events per day).
  • Realistic sustained level 8-9 MEvents/day. We
    expect about 10 endemic inefficiency due to
    issues not worth our fixing like internal
    latencies, facility power outages, hardware
    failure recovery.
  • Observed sustained level 5.2 MEvents/day.
    60-65 of expected value.
  • Absolute numbers are not the focus as yet,
    rather, the ratio by all is agreed to be
    unacceptably low.

5
Roadmap(Sep 2008 presentation with updates in
Green)
  • September 2008 Planning DONE
  • Rob Kennedy, working with Adam Lyon, charged by
    Vicky White to lead effort to pursue this.
  • First stage is to list, understand, and
    prioritize the problems and the work in progress.
  • Next, develop a broad coarse-grained plan to
    address issues and improve the efficiency.
  • October 2008 December 2008 Phase 1 Of the
    Initiative DONE
  • 1.1. Server Expansion and Decoupling Data/MC
    Production at Services
  • 1.2. Condor 7.0 Upgrade and Support
  • 1.3. Small Quick Wins
  • 1.4. Metrics
  • Follow-up on newly exposed issues as revealed
    eg. Installer products, Fcpd upgrade, restart
    script fix
  • January 2009 Formal Re-Assessment with a
    long-term mindset DONE
  • Re-assess against metrics, downtime cause
    categorization, D0CD staff-time in ops.
    Re-prioritize issues.
  • Capacity Management determined to be the primary
    theme for Phase 2 work.
  • Plan new work for the next layer of issues
    revealed. Ready to tackle MC Production-specific
    issues as well?
  • February 2009 April 2009 Phase 2 Finish long
    lead-time work treat next layer.
  • Some work for Data Production is constrained to
    execute in 2009, eg. Applying virtualization.

6
Phase 1 Summary
  • Work Done Add Servers, Decoupling of Data/MC
    Prod at Services, Condor 7 Upgrade (Grid Batch
    System layer)
  • Add 4th and 5th Forwarding Node, 2nd Queuing
    Node. Add new SAM Station and Context Server
    host. Document, productize installation
    procedures. Configured to decouple Data MC at
    Fwd, Que Services.
  • Condor 7 is major improvement! Several major
    issues fixed. More predictable behavior and
    latencies.
  • Outcome Successful Pre-Thanksgiving Deployment
  • Mike D. Dec/Jan Holidays was one of the least
    eventful periods ever.
  • Smooth enough now have begun testing hand-off of
    day-to-day coordination, with Mike D. oversight.
  • Numerous Operations issues resolved. Resource
    Utilization improved, reached goal
  • Periodic Expressions 1/day hang cured. No more
    Death Spirals leading to downtimes.
  • Job Slot Utilization and CPU-time/Wall-time gt 95
    (in smooth operation) Confirmed over time
    Success!
  • January 2009 some next layer issues seen.
  • Events Processed per Day not really improved
  • Increase in Tevatron Luminosity suspected...
    Confirmed.
  • CPU-time per Job increasing rapidly... Confirmed.
  • We have seen 2X increase Oct 08 to Dec08/Jan
    09!
  • 8E6 events/day goal was appropriate for lower
    luminosity.
  • Note 1 month delay from data logging to
    production.

7
Assessment (January 2009)
  • Main focus Understanding Events Produced per
    Day
  • Calculate the expected production rate from
    existing system
  • Cpu/event with current Reco version f(L)
  • Cpu power in Data Production queue
  • Luminosity increase in Tevatron is major driver
    of reduced output of production
  • Consider the environment as well
  • Recent shutdown led to detector fixes. More good
    data per event more CPU/event (small effect)
  • Modest increase in CPU/event in new Reco version
    at higher luminosity (small effect)
  • Check CPU overheads (setting up, starting Reco)
    vs. Reco CPU consumption (small-ish effect)
  • Observe and compare system performance during
    smooth multi-day periods
  • Develop a Phase 2 Work Plan
  • Observation Data Production is falling behind
    Data Logging now.
  • This is our top priority to address understand
    what CAN be done and report to D0 for their
    planning.
  • Capacity increase options being explored, as well
    as impact on infrastructure, configuration
  • Model development continues to insure no hidden
    inefficiencies at 10 level.
  • Consensus last effort to reduce cpu consumption
    by D0 Reco ? no room for improvement

8
Plots Efficiency, cpu/evt f(L)
  • THIS IS TEXT for next three plot slides
  • Are there hidden inefficiencies? PBS Job
    Efficiency (CPU Use) From Mike D. AVAILABLE
    HERE.
  • Time base is date that data processed, not date
    that data was recorded.
  • Job Efficiency Run-time / (Run-timeOverheads)
  • After Phase 1 Deployment, metric is at 95...
    Very good!
  • Does not take into account the following
  • Jobs that started, had data, but failed (1
    effect) Nodes which are down (1 effect)
  • Merge jobs included in this (2 effect) Jobs
    that do not really start due to data delivery
    failure (1 effect)
  • Overall Duty Cycle (95) to account for
    planned/unplanned downtimes
  • For long-term planning Use 85-90 CPU efficiency
    (CPU cycles available that are used on Reco)
    still, very good.
  • Execution Time f(L) From Mike D. AVAILABLE
    HERE.
  • This is for the current version of Reco
    (previously was for old version). Some increase
    in CPU used perhaps at higher L.
  • Also, detector improvements after shutdown ? more
    good data/event, more combinatorics ? more
    CPU/event.
  • GOOD FOR PHYSICS! but a challenge for
    Reconstruction Farm.
  • Average Initial Luminosity From Mike D.
    AVAILABLE HERE.
  • We appear to be around L 165 E30 nowadays.
    Combining this with Execution time About 60
    cpu-sec/event, which gives
  • 6 MEvents/day theoretically, and at same time
    period, 5.1 MEvents/day observed under the same
    conditions.
  • Given width and uncertainty in measurements
    above, we cannot say these two numbers are
    different.

9
PBS Job Efficiency (CPU Use)
Smooth Operation Today gt95
Ops Issue
No major downtimes after Phase 1 Deploy
10
Execution Time f(L)(initial luminosity at
begin of run, not at begin of store)
Eventually 120 sec/evt ??? (watch, have plan in
place)
Now 60 sec/evt
Past 30 sec/evt
11
Average Initial Luminosity(initial luminosity
at begin of run, not at begin of store)
Long-term Bracket?
?
Now 60 sec/evt
12
Phase 2 Work List Outline
  • 2.1 Capacity Management Data Prod is not keeping
    up with data logging.
  • Capacity Planning Model nEvents per Day
    forecast CPU needed
  • Capacity Deployment Procure, acquire, borrow
    CPU. We believe infrastructure is capable.
  • Resource Utilization Use what we have as much as
    possible. Maintain improvements.
  • 2.2 Availability Continuity Management
    Expanded system needs higher reliability
  • Decoupling deferred. Phase 1 work has proven
    sufficient for near-term.
  • Stability, Reduced Effort Deeper queues. Goal is
    fewer manual submissions per week.
  • Resilience Add/improve redundancy at
    infrastructure service and CAB level.
  • Configuration Recovery Capture configuration and
    artefacts in CVS consistently.
  • 2.3 Operations-Driven Projects
  • Monitoring Execute a workshop to share what we
    have, identify gaps and cost/benefits.
  • Issues Address stuck state issue affecting
    both Data and MC Production
  • Features Add state at queuing node (from Phase
    1). Distribute jobs evenly across FWD.
  • Processes Enable REX/Ops to deploy new Condor
    new bug fixes coming soon.
  • Phase 1 Follow-up Few minor tasks remain from
    rush to deploy dot-is and cross-ts.
  • Deferred Work List maintain with reasons for
    deferring work.

13
Data Flow for Data Production
Also on cache nodes 0-bias skim, LCG cache
0
Tarballs
Initiated by Reco Job
1
Enstore LTO4-G
SAM Cache d0srv071 d0srv072
2
Raw Data
5
Initiated by Merge Job, Via gridftp
Other data destined for Tape Storage
In2p3 remote uploads
Worker Nodes Scratch space
Unmerged TMB
4
3
6
Durable Store, Stager Space d0srv063 d0srv065
Enstore LTO4-F
Merged TMB
7
Durable Storage and Stager Space are on separate
partitions
Shared w/Analysis Users
No automated failover between 63, 65
14
Capacity Timeline Working Draft
  • March ? April 2009 Keep-Up Level
    Work-through-Backlog Level
  • Added 115 old, slow, retired CDF worker nodes.
    D0Farm from 1600 to 1814 slots (as of 26 Feb
    2008)
  • Upgrade PBS head nodes (FEF) during March 10
    downtime. Last infrastructure improvement needed.
  • All CAB2 analysis nodes for use by Data
    Production March 10 ? May 1 (or any other end
    condition met)
  • Work through 178 MEvt backlog (less 1 week). A
    backlog has been there, BUT NOW we can REALLY do
    something.
  • Scale-up in steps quickly to be sure
    infrastructure can handle load, avoid waste of
    graciously allocated resources
  • Exploit more opportunistic use of Other VO cpu
    during this same time period
  • Purchase Req out in late March for more CPUs
    will be in service towards end of summer.
  • (End April End Initiative. Task mgmt passes to
    existing CD groups. Close-out process in May
    2009.)
  • May ? July 2009 not Keep-Up Level. GAP TO BE
    FILLED.
  • Downsize system as analysis CPU returned, less
    opportunistic CPU available.
  • May develop a backlog again, but too late anyway
    to fully process for summer conferences.
  • New CPU may arrive in July, but will have to be
    burned in, infrastructure tested, etc
  • Purchase Req out in summer for more
    infrastructure servers (if need proven)
  • August ? December 2009 Keep-Up Level (
    headroom?)
  • Add CPU and infrastructure (from procurement) to
    support a long-term keep-up system.
  • Make up backlog from May through June 2009 for
    winter conferences.

15
CAB2 Temp Expanded Use
  • Early-March ? April 2009 Keep-Up Level
    Work-through-Backlog Level
  • Temp Expanded CAB2 Use by Data Production
    2/20/2009 via Email
  • Regarding temporarily using the whole CAB2 for
    the production, D0 management has made a decision
    that from March 10, we will temporarily expand
    the d0 farm queue to be the whole CAB2. The
    purpose is to catch up the backlog in data
    production for the summer conference. This
    configuration is temporary. We will change it
    back to the current configuration when one of the
    following condition happens
  •    - when the backlog has been reduced to be
    less than one week of data or    - May 1, 2009,
    or    - when there is an analysis need for more
    CPUs than CAB1 can provide. Although the
    configuration change will be done by FEF (thanks
    to FEF!), the SamGrid team may need to plan to
    adjust related parameters to handle a much larger
    production farm. The current d0 farm queue has
    1800 job slots. The new d0 farm queue will have
    18001400 job slots, temporarily. Thank you,
    Qizhong

16
Next Steps, Conclusion
  • Conclusion Phase 1 succeeded. Accommodate
    Tevatron success in Phase 2
  • The D0 Grid Data Production is certainly more
    stable than before.
  • Improvement in resource utilization metrics
    appears genuine.
  • Next layer of operations issues are addressable
    can improve even further.
  • Next Steps
  • Phase 2 Develop, implement a viable short-term
    and draft long-term Capacity Plan
  • And do so without losing the gains in stability
    and resource utilization achieved so far.
  • Work through event backlog with loaned CAB2
    slots.
  • Continue work on stability, resilience, optimal
    decoupled configuration, monitoring
  • Take Care though Service scale-ups like this
    have revealed new weaknesses, behaviors.
  • Further Steps towards Maturing the D0 Grid
    Production System as a Service
  • More Robust, Capable, and Manageable System
    requiring less effort to use.
  • Enable Service Management Functions Capacity
    Planning, Managed Growth.
  • Capacity Management can sensibly lead to a more
    formal statement of service levels.
Write a Comment
User Comments (0)
About PowerShow.com