Title: D0 Grid Data Production Initiative: Phase 1 to Phase 2 Transition
1D0 Grid Data Production InitiativePhase 1 to
Phase 2 Transition
- Version 1.3
- (v1.0 presented to D0 Spokes, CD Mgmt 06
February 2009) - Presented to D0 CPB 27 February 2009
- Rob Kennedy and Adam Lyon
2Outline
- Background (historical reference)
- Overview, Major Issues, Roadmap
- Phase 1 Summary
- Work Done and Outcome (as seen with more
experience) - Assessment
- Capacity Model, cpu/event f(L)
- Phase 2 Work Plan
- Work List Outline
- Capacity Timeline skeleton draft
3Initiative Overview(Sep 2008 presentation with
updates in Green)
- Initiative is an Umbrella Project to achieve a
broad set of goals - Scope D0 Grid Data Production (taking MC
Production into consideration) - Charge
- Evaluate D0 Grid Data Production, especially
Resource Utilization by end Sep 08 DONE - Create and execute a Work Plan to achieve goal
Phase 1 DONE - Goal Stable Grid Data Production operations that
efficiently utilizes the resources available.
DONE for conditions at beginning of Initiative.
Phase 2 to address evolving conditions. - Constraints Achieve improvements ASARP. No
explicit end date or staff level limits set. - Initiative Team (Execution Phase) October 2008 ?
present - Project Manager Rob Kennedy CD OPMQA
- Project Co-Manager Adam Lyon D0 Collab and CD
SCF/REX/PS Group Leader - Communication with broad set of stakeholders
Weekly meeting Thursdays at 9am - D0 Production Coordinators Mike Diesburg, Joel
Snow D0 Collaborators Chip Brock, Qizhong Li - CD FermiGrid Svcs (Steve Timm, Keith Chadwick),
SAM-Grid Dev (Gabriele Garzoglio, Parag
Mhashilkar, Andrew Baranovski), REX (Robert
Illingworth, Joe Boyd), SCF Mgmt (Margaret
Votava, Eileen Berman), Fermi Expt Facilities
(Jason Allen, Glenn Cooper) - OSG Abhishek Singh Rana
- Documentation Home http//d0db-prd.fnal.gov/rexip
edia/common/SAMGridD0/GDPEval
4Major Issues(Sep 2008 presentation)
- Resource Utilization is lower than expected for a
production system (the motivating concern). - CPUs allotted to Data Production use are not kept
busy, but there are jobs data are available be
run. - Causes Shallow queues must be refilled often.
Something is leading to slow filling of open CPU
slots. - D0 Grid System Uptime, First-Time Success Rates
are lower than expected for production. - Leads to re-running of jobs and/or manual
checking of job records to determine
success/failure - Causes Grid Batch System bugs (some known to be
fixed in Condor 7), Context Event Server
failures, - D0 Grid System requires too much effort for
customer (D0 Production Coordinator) to use. - Hours per day looking at failed jobs or if jobs
failed. 1-2 touches per day to keep queues full
(w/scripts). - Sum of the above significantly reduces the
MEvents/day that D0 actually reconstructs. - Mike Diesburg estimates (Sep 2008), confirmed by
historical record, BEFORE the Initiative - Max capacity of current system 10 MEvents/day
(million events per day). - Realistic sustained level 8-9 MEvents/day. We
expect about 10 endemic inefficiency due to
issues not worth our fixing like internal
latencies, facility power outages, hardware
failure recovery. - Observed sustained level 5.2 MEvents/day.
60-65 of expected value. - Absolute numbers are not the focus as yet,
rather, the ratio by all is agreed to be
unacceptably low.
5Roadmap(Sep 2008 presentation with updates in
Green)
- September 2008 Planning DONE
- Rob Kennedy, working with Adam Lyon, charged by
Vicky White to lead effort to pursue this. - First stage is to list, understand, and
prioritize the problems and the work in progress. - Next, develop a broad coarse-grained plan to
address issues and improve the efficiency. - October 2008 December 2008 Phase 1 Of the
Initiative DONE - 1.1. Server Expansion and Decoupling Data/MC
Production at Services - 1.2. Condor 7.0 Upgrade and Support
- 1.3. Small Quick Wins
- 1.4. Metrics
- Follow-up on newly exposed issues as revealed
eg. Installer products, Fcpd upgrade, restart
script fix - January 2009 Formal Re-Assessment with a
long-term mindset DONE - Re-assess against metrics, downtime cause
categorization, D0CD staff-time in ops.
Re-prioritize issues. - Capacity Management determined to be the primary
theme for Phase 2 work. - Plan new work for the next layer of issues
revealed. Ready to tackle MC Production-specific
issues as well? - February 2009 April 2009 Phase 2 Finish long
lead-time work treat next layer. - Some work for Data Production is constrained to
execute in 2009, eg. Applying virtualization.
6Phase 1 Summary
- Work Done Add Servers, Decoupling of Data/MC
Prod at Services, Condor 7 Upgrade (Grid Batch
System layer) - Add 4th and 5th Forwarding Node, 2nd Queuing
Node. Add new SAM Station and Context Server
host. Document, productize installation
procedures. Configured to decouple Data MC at
Fwd, Que Services. - Condor 7 is major improvement! Several major
issues fixed. More predictable behavior and
latencies. - Outcome Successful Pre-Thanksgiving Deployment
- Mike D. Dec/Jan Holidays was one of the least
eventful periods ever. - Smooth enough now have begun testing hand-off of
day-to-day coordination, with Mike D. oversight. - Numerous Operations issues resolved. Resource
Utilization improved, reached goal - Periodic Expressions 1/day hang cured. No more
Death Spirals leading to downtimes. - Job Slot Utilization and CPU-time/Wall-time gt 95
(in smooth operation) Confirmed over time
Success! - January 2009 some next layer issues seen.
- Events Processed per Day not really improved
- Increase in Tevatron Luminosity suspected...
Confirmed. - CPU-time per Job increasing rapidly... Confirmed.
- We have seen 2X increase Oct 08 to Dec08/Jan
09! - 8E6 events/day goal was appropriate for lower
luminosity. - Note 1 month delay from data logging to
production.
7Assessment (January 2009)
- Main focus Understanding Events Produced per
Day - Calculate the expected production rate from
existing system - Cpu/event with current Reco version f(L)
- Cpu power in Data Production queue
- Luminosity increase in Tevatron is major driver
of reduced output of production - Consider the environment as well
- Recent shutdown led to detector fixes. More good
data per event more CPU/event (small effect) - Modest increase in CPU/event in new Reco version
at higher luminosity (small effect) - Check CPU overheads (setting up, starting Reco)
vs. Reco CPU consumption (small-ish effect) - Observe and compare system performance during
smooth multi-day periods - Develop a Phase 2 Work Plan
- Observation Data Production is falling behind
Data Logging now. - This is our top priority to address understand
what CAN be done and report to D0 for their
planning. - Capacity increase options being explored, as well
as impact on infrastructure, configuration - Model development continues to insure no hidden
inefficiencies at 10 level. - Consensus last effort to reduce cpu consumption
by D0 Reco ? no room for improvement
8Plots Efficiency, cpu/evt f(L)
- THIS IS TEXT for next three plot slides
- Are there hidden inefficiencies? PBS Job
Efficiency (CPU Use) From Mike D. AVAILABLE
HERE. - Time base is date that data processed, not date
that data was recorded. - Job Efficiency Run-time / (Run-timeOverheads)
- After Phase 1 Deployment, metric is at 95...
Very good! - Does not take into account the following
- Jobs that started, had data, but failed (1
effect) Nodes which are down (1 effect) - Merge jobs included in this (2 effect) Jobs
that do not really start due to data delivery
failure (1 effect) - Overall Duty Cycle (95) to account for
planned/unplanned downtimes - For long-term planning Use 85-90 CPU efficiency
(CPU cycles available that are used on Reco)
still, very good. - Execution Time f(L) From Mike D. AVAILABLE
HERE. - This is for the current version of Reco
(previously was for old version). Some increase
in CPU used perhaps at higher L. - Also, detector improvements after shutdown ? more
good data/event, more combinatorics ? more
CPU/event. - GOOD FOR PHYSICS! but a challenge for
Reconstruction Farm. - Average Initial Luminosity From Mike D.
AVAILABLE HERE. - We appear to be around L 165 E30 nowadays.
Combining this with Execution time About 60
cpu-sec/event, which gives - 6 MEvents/day theoretically, and at same time
period, 5.1 MEvents/day observed under the same
conditions. - Given width and uncertainty in measurements
above, we cannot say these two numbers are
different.
9PBS Job Efficiency (CPU Use)
Smooth Operation Today gt95
Ops Issue
No major downtimes after Phase 1 Deploy
10Execution Time f(L)(initial luminosity at
begin of run, not at begin of store)
Eventually 120 sec/evt ??? (watch, have plan in
place)
Now 60 sec/evt
Past 30 sec/evt
11Average Initial Luminosity(initial luminosity
at begin of run, not at begin of store)
Long-term Bracket?
?
Now 60 sec/evt
12Phase 2 Work List Outline
- 2.1 Capacity Management Data Prod is not keeping
up with data logging. - Capacity Planning Model nEvents per Day
forecast CPU needed - Capacity Deployment Procure, acquire, borrow
CPU. We believe infrastructure is capable. - Resource Utilization Use what we have as much as
possible. Maintain improvements. - 2.2 Availability Continuity Management
Expanded system needs higher reliability - Decoupling deferred. Phase 1 work has proven
sufficient for near-term. - Stability, Reduced Effort Deeper queues. Goal is
fewer manual submissions per week. - Resilience Add/improve redundancy at
infrastructure service and CAB level. - Configuration Recovery Capture configuration and
artefacts in CVS consistently. - 2.3 Operations-Driven Projects
- Monitoring Execute a workshop to share what we
have, identify gaps and cost/benefits. - Issues Address stuck state issue affecting
both Data and MC Production - Features Add state at queuing node (from Phase
1). Distribute jobs evenly across FWD. - Processes Enable REX/Ops to deploy new Condor
new bug fixes coming soon. - Phase 1 Follow-up Few minor tasks remain from
rush to deploy dot-is and cross-ts. - Deferred Work List maintain with reasons for
deferring work.
13Data Flow for Data Production
Also on cache nodes 0-bias skim, LCG cache
0
Tarballs
Initiated by Reco Job
1
Enstore LTO4-G
SAM Cache d0srv071 d0srv072
2
Raw Data
5
Initiated by Merge Job, Via gridftp
Other data destined for Tape Storage
In2p3 remote uploads
Worker Nodes Scratch space
Unmerged TMB
4
3
6
Durable Store, Stager Space d0srv063 d0srv065
Enstore LTO4-F
Merged TMB
7
Durable Storage and Stager Space are on separate
partitions
Shared w/Analysis Users
No automated failover between 63, 65
14Capacity Timeline Working Draft
- March ? April 2009 Keep-Up Level
Work-through-Backlog Level - Added 115 old, slow, retired CDF worker nodes.
D0Farm from 1600 to 1814 slots (as of 26 Feb
2008) - Upgrade PBS head nodes (FEF) during March 10
downtime. Last infrastructure improvement needed. - All CAB2 analysis nodes for use by Data
Production March 10 ? May 1 (or any other end
condition met) - Work through 178 MEvt backlog (less 1 week). A
backlog has been there, BUT NOW we can REALLY do
something. - Scale-up in steps quickly to be sure
infrastructure can handle load, avoid waste of
graciously allocated resources - Exploit more opportunistic use of Other VO cpu
during this same time period - Purchase Req out in late March for more CPUs
will be in service towards end of summer. - (End April End Initiative. Task mgmt passes to
existing CD groups. Close-out process in May
2009.) - May ? July 2009 not Keep-Up Level. GAP TO BE
FILLED. - Downsize system as analysis CPU returned, less
opportunistic CPU available. - May develop a backlog again, but too late anyway
to fully process for summer conferences. - New CPU may arrive in July, but will have to be
burned in, infrastructure tested, etc - Purchase Req out in summer for more
infrastructure servers (if need proven) - August ? December 2009 Keep-Up Level (
headroom?) - Add CPU and infrastructure (from procurement) to
support a long-term keep-up system. - Make up backlog from May through June 2009 for
winter conferences.
15CAB2 Temp Expanded Use
- Early-March ? April 2009 Keep-Up Level
Work-through-Backlog Level - Temp Expanded CAB2 Use by Data Production
2/20/2009 via Email - Regarding temporarily using the whole CAB2 for
the production, D0 management has made a decision
that from March 10, we will temporarily expand
the d0 farm queue to be the whole CAB2. The
purpose is to catch up the backlog in data
production for the summer conference. This
configuration is temporary. We will change it
back to the current configuration when one of the
following condition happens - - when the backlog has been reduced to be
less than one week of data or - May 1, 2009,
or - when there is an analysis need for more
CPUs than CAB1 can provide. Although the
configuration change will be done by FEF (thanks
to FEF!), the SamGrid team may need to plan to
adjust related parameters to handle a much larger
production farm. The current d0 farm queue has
1800 job slots. The new d0 farm queue will have
18001400 job slots, temporarily. Thank you,
Qizhong
16Next Steps, Conclusion
- Conclusion Phase 1 succeeded. Accommodate
Tevatron success in Phase 2 - The D0 Grid Data Production is certainly more
stable than before. - Improvement in resource utilization metrics
appears genuine. - Next layer of operations issues are addressable
can improve even further. - Next Steps
- Phase 2 Develop, implement a viable short-term
and draft long-term Capacity Plan - And do so without losing the gains in stability
and resource utilization achieved so far. - Work through event backlog with loaned CAB2
slots. - Continue work on stability, resilience, optimal
decoupled configuration, monitoring - Take Care though Service scale-ups like this
have revealed new weaknesses, behaviors. - Further Steps towards Maturing the D0 Grid
Production System as a Service - More Robust, Capable, and Manageable System
requiring less effort to use. - Enable Service Management Functions Capacity
Planning, Managed Growth. - Capacity Management can sensibly lead to a more
formal statement of service levels.