D0 Grid Data Production Initiative: Phase 1 to Phase 2 Transition - PowerPoint PPT Presentation

1 / 16

About This Presentation

Title:

D0 Grid Data Production Initiative: Phase 1 to Phase 2 Transition

Description:

... up, starting Reco) vs. Reco CPU consumption (small-ish effect) ... Purchase Req out in late March for more CPUs... will be in service towards end of summer. ... – PowerPoint PPT presentation

Number of Views:24

Avg rating:3.0/5.0

Slides: 17

Provided by: cddocd

Learn more at: https://pingprod.fnal.gov

Category:

more less

Transcript and Presenter's Notes

Title: D0 Grid Data Production Initiative: Phase 1 to Phase 2 Transition

1
D0 Grid Data Production InitiativePhase 1 to
Phase 2 Transition

Version 1.3
(v1.0 presented to D0 Spokes, CD Mgmt 06
February 2009)
Presented to D0 CPB 27 February 2009
Rob Kennedy and Adam Lyon

2
Outline

Background (historical reference)
Overview, Major Issues, Roadmap
Phase 1 Summary
Work Done and Outcome (as seen with more
experience)
Assessment
Capacity Model, cpu/event f(L)
Phase 2 Work Plan
Work List Outline
Capacity Timeline skeleton draft

3
Initiative Overview(Sep 2008 presentation with
updates in Green)

Initiative is an Umbrella Project to achieve a
broad set of goals
Scope D0 Grid Data Production (taking MC
Production into consideration)
Charge
Evaluate D0 Grid Data Production, especially
Resource Utilization by end Sep 08 DONE
Create and execute a Work Plan to achieve goal
Phase 1 DONE
Goal Stable Grid Data Production operations that
efficiently utilizes the resources available.
DONE for conditions at beginning of Initiative.
Phase 2 to address evolving conditions.
Constraints Achieve improvements ASARP. No
explicit end date or staff level limits set.
Initiative Team (Execution Phase) October 2008 ?
present
Project Manager Rob Kennedy CD OPMQA
Project Co-Manager Adam Lyon D0 Collab and CD
SCF/REX/PS Group Leader
Communication with broad set of stakeholders
Weekly meeting Thursdays at 9am
D0 Production Coordinators Mike Diesburg, Joel
Snow D0 Collaborators Chip Brock, Qizhong Li
CD FermiGrid Svcs (Steve Timm, Keith Chadwick),
SAM-Grid Dev (Gabriele Garzoglio, Parag
Mhashilkar, Andrew Baranovski), REX (Robert
Illingworth, Joe Boyd), SCF Mgmt (Margaret
Votava, Eileen Berman), Fermi Expt Facilities
(Jason Allen, Glenn Cooper)
OSG Abhishek Singh Rana
Documentation Home http//d0db-prd.fnal.gov/rexip
edia/common/SAMGridD0/GDPEval

4
Major Issues(Sep 2008 presentation)

Resource Utilization is lower than expected for a
production system (the motivating concern).
CPUs allotted to Data Production use are not kept
busy, but there are jobs data are available be
run.
Causes Shallow queues must be refilled often.
Something is leading to slow filling of open CPU
slots.
D0 Grid System Uptime, First-Time Success Rates
are lower than expected for production.
Leads to re-running of jobs and/or manual
checking of job records to determine
success/failure
Causes Grid Batch System bugs (some known to be
fixed in Condor 7), Context Event Server
failures,
D0 Grid System requires too much effort for
customer (D0 Production Coordinator) to use.
Hours per day looking at failed jobs or if jobs
failed. 1-2 touches per day to keep queues full
(w/scripts).
Sum of the above significantly reduces the
MEvents/day that D0 actually reconstructs.
Mike Diesburg estimates (Sep 2008), confirmed by
historical record, BEFORE the Initiative
Max capacity of current system 10 MEvents/day
(million events per day).
Realistic sustained level 8-9 MEvents/day. We
expect about 10 endemic inefficiency due to
issues not worth our fixing like internal
latencies, facility power outages, hardware
failure recovery.
Observed sustained level 5.2 MEvents/day.
60-65 of expected value.
Absolute numbers are not the focus as yet,
rather, the ratio by all is agreed to be
unacceptably low.

5
Roadmap(Sep 2008 presentation with updates in
Green)

September 2008 Planning DONE
Rob Kennedy, working with Adam Lyon, charged by
Vicky White to lead effort to pursue this.
First stage is to list, understand, and
prioritize the problems and the work in progress.
Next, develop a broad coarse-grained plan to
address issues and improve the efficiency.
October 2008 December 2008 Phase 1 Of the
Initiative DONE
1.1. Server Expansion and Decoupling Data/MC
Production at Services
1.2. Condor 7.0 Upgrade and Support
1.3. Small Quick Wins
1.4. Metrics
Follow-up on newly exposed issues as revealed
eg. Installer products, Fcpd upgrade, restart
script fix
January 2009 Formal Re-Assessment with a
long-term mindset DONE
Re-assess against metrics, downtime cause
categorization, D0CD staff-time in ops.
Re-prioritize issues.
Capacity Management determined to be the primary
theme for Phase 2 work.
Plan new work for the next layer of issues
revealed. Ready to tackle MC Production-specific
issues as well?
February 2009 April 2009 Phase 2 Finish long
lead-time work treat next layer.
Some work for Data Production is constrained to
execute in 2009, eg. Applying virtualization.

6
Phase 1 Summary

Work Done Add Servers, Decoupling of Data/MC
Prod at Services, Condor 7 Upgrade (Grid Batch
System layer)
Add 4th and 5th Forwarding Node, 2nd Queuing
Node. Add new SAM Station and Context Server
host. Document, productize installation
procedures. Configured to decouple Data MC at
Fwd, Que Services.
Condor 7 is major improvement! Several major
issues fixed. More predictable behavior and
latencies.
Outcome Successful Pre-Thanksgiving Deployment
Mike D. Dec/Jan Holidays was one of the least
eventful periods ever.
Smooth enough now have begun testing hand-off of
day-to-day coordination, with Mike D. oversight.
Numerous Operations issues resolved. Resource
Utilization improved, reached goal
Periodic Expressions 1/day hang cured. No more
Death Spirals leading to downtimes.
Job Slot Utilization and CPU-time/Wall-time gt 95
(in smooth operation) Confirmed over time
Success!
January 2009 some next layer issues seen.
Events Processed per Day not really improved
Increase in Tevatron Luminosity suspected...
Confirmed.
CPU-time per Job increasing rapidly... Confirmed.
We have seen 2X increase Oct 08 to Dec08/Jan
09!
8E6 events/day goal was appropriate for lower
luminosity.
Note 1 month delay from data logging to
production.

7
Assessment (January 2009)

Main focus Understanding Events Produced per
Day
Calculate the expected production rate from
existing system
Cpu/event with current Reco version f(L)
Cpu power in Data Production queue
Luminosity increase in Tevatron is major driver
of reduced output of production
Consider the environment as well
Recent shutdown led to detector fixes. More good
data per event more CPU/event (small effect)
Modest increase in CPU/event in new Reco version
at higher luminosity (small effect)
Check CPU overheads (setting up, starting Reco)
vs. Reco CPU consumption (small-ish effect)
Observe and compare system performance during
smooth multi-day periods
Develop a Phase 2 Work Plan
Observation Data Production is falling behind
Data Logging now.
This is our top priority to address understand
what CAN be done and report to D0 for their
planning.
Capacity increase options being explored, as well
as impact on infrastructure, configuration
Model development continues to insure no hidden
inefficiencies at 10 level.
Consensus last effort to reduce cpu consumption
by D0 Reco ? no room for improvement

8
Plots Efficiency, cpu/evt f(L)

THIS IS TEXT for next three plot slides
Are there hidden inefficiencies? PBS Job
Efficiency (CPU Use) From Mike D. AVAILABLE
HERE.
Time base is date that data processed, not date
that data was recorded.
Job Efficiency Run-time / (Run-timeOverheads)
After Phase 1 Deployment, metric is at 95...
Very good!
Does not take into account the following
Jobs that started, had data, but failed (1
effect) Nodes which are down (1 effect)
Merge jobs included in this (2 effect) Jobs
that do not really start due to data delivery
failure (1 effect)
Overall Duty Cycle (95) to account for
planned/unplanned downtimes
For long-term planning Use 85-90 CPU efficiency
(CPU cycles available that are used on Reco)
still, very good.
Execution Time f(L) From Mike D. AVAILABLE
HERE.
This is for the current version of Reco
(previously was for old version). Some increase
in CPU used perhaps at higher L.
Also, detector improvements after shutdown ? more
good data/event, more combinatorics ? more
CPU/event.
GOOD FOR PHYSICS! but a challenge for
Reconstruction Farm.
Average Initial Luminosity From Mike D.
AVAILABLE HERE.
We appear to be around L 165 E30 nowadays.
Combining this with Execution time About 60
cpu-sec/event, which gives
6 MEvents/day theoretically, and at same time
period, 5.1 MEvents/day observed under the same
conditions.
Given width and uncertainty in measurements
above, we cannot say these two numbers are
different.

9
PBS Job Efficiency (CPU Use)
Smooth Operation Today gt95
Ops Issue
No major downtimes after Phase 1 Deploy
10
Execution Time f(L)(initial luminosity at
begin of run, not at begin of store)
Eventually 120 sec/evt ??? (watch, have plan in
place)
Now 60 sec/evt
Past 30 sec/evt
11
Average Initial Luminosity(initial luminosity
at begin of run, not at begin of store)
Long-term Bracket?
?
Now 60 sec/evt
12
Phase 2 Work List Outline

2.1 Capacity Management Data Prod is not keeping
up with data logging.
Capacity Planning Model nEvents per Day
forecast CPU needed
Capacity Deployment Procure, acquire, borrow
CPU. We believe infrastructure is capable.
Resource Utilization Use what we have as much as
possible. Maintain improvements.
2.2 Availability Continuity Management
Expanded system needs higher reliability
Decoupling deferred. Phase 1 work has proven
sufficient for near-term.
Stability, Reduced Effort Deeper queues. Goal is
fewer manual submissions per week.
Resilience Add/improve redundancy at
infrastructure service and CAB level.
Configuration Recovery Capture configuration and
artefacts in CVS consistently.
2.3 Operations-Driven Projects
Monitoring Execute a workshop to share what we
have, identify gaps and cost/benefits.
Issues Address stuck state issue affecting
both Data and MC Production
Features Add state at queuing node (from Phase
1). Distribute jobs evenly across FWD.
Processes Enable REX/Ops to deploy new Condor
new bug fixes coming soon.
Phase 1 Follow-up Few minor tasks remain from
rush to deploy dot-is and cross-ts.
Deferred Work List maintain with reasons for
deferring work.

13
Data Flow for Data Production
Also on cache nodes 0-bias skim, LCG cache
0
Tarballs
Initiated by Reco Job
1
Enstore LTO4-G
SAM Cache d0srv071 d0srv072
2
Raw Data
5
Initiated by Merge Job, Via gridftp
Other data destined for Tape Storage
In2p3 remote uploads
Worker Nodes Scratch space
Unmerged TMB
4
3
6
Durable Store, Stager Space d0srv063 d0srv065
Enstore LTO4-F
Merged TMB
7
Durable Storage and Stager Space are on separate
partitions
Shared w/Analysis Users
No automated failover between 63, 65
14
Capacity Timeline Working Draft

March ? April 2009 Keep-Up Level
Work-through-Backlog Level
Added 115 old, slow, retired CDF worker nodes.
D0Farm from 1600 to 1814 slots (as of 26 Feb
2008)
Upgrade PBS head nodes (FEF) during March 10
downtime. Last infrastructure improvement needed.
All CAB2 analysis nodes for use by Data
Production March 10 ? May 1 (or any other end
condition met)
Work through 178 MEvt backlog (less 1 week). A
backlog has been there, BUT NOW we can REALLY do
something.
Scale-up in steps quickly to be sure
infrastructure can handle load, avoid waste of
graciously allocated resources
Exploit more opportunistic use of Other VO cpu
during this same time period
Purchase Req out in late March for more CPUs
will be in service towards end of summer.
(End April End Initiative. Task mgmt passes to
existing CD groups. Close-out process in May
2009.)
May ? July 2009 not Keep-Up Level. GAP TO BE
FILLED.
Downsize system as analysis CPU returned, less
opportunistic CPU available.
May develop a backlog again, but too late anyway
to fully process for summer conferences.
New CPU may arrive in July, but will have to be
burned in, infrastructure tested, etc
Purchase Req out in summer for more
infrastructure servers (if need proven)
August ? December 2009 Keep-Up Level (
headroom?)
Add CPU and infrastructure (from procurement) to
support a long-term keep-up system.
Make up backlog from May through June 2009 for
winter conferences.

15
CAB2 Temp Expanded Use

Early-March ? April 2009 Keep-Up Level
Work-through-Backlog Level
Temp Expanded CAB2 Use by Data Production
2/20/2009 via Email
Regarding temporarily using the whole CAB2 for
the production, D0 management has made a decision
that from March 10, we will temporarily expand
the d0 farm queue to be the whole CAB2. The
purpose is to catch up the backlog in data
production for the summer conference. This
configuration is temporary. We will change it
back to the current configuration when one of the
following condition happens
   - when the backlog has been reduced to be
less than one week of data or    - May 1, 2009,
or    - when there is an analysis need for more
CPUs than CAB1 can provide. Although the
configuration change will be done by FEF (thanks
to FEF!), the SamGrid team may need to plan to
adjust related parameters to handle a much larger
production farm. The current d0 farm queue has
1800 job slots. The new d0 farm queue will have
18001400 job slots, temporarily. Thank you,
Qizhong

16
Next Steps, Conclusion

Conclusion Phase 1 succeeded. Accommodate
Tevatron success in Phase 2
The D0 Grid Data Production is certainly more
stable than before.
Improvement in resource utilization metrics
appears genuine.
Next layer of operations issues are addressable
can improve even further.
Next Steps
Phase 2 Develop, implement a viable short-term
and draft long-term Capacity Plan
And do so without losing the gains in stability
and resource utilization achieved so far.
Work through event backlog with loaned CAB2
slots.
Continue work on stability, resilience, optimal
decoupled configuration, monitoring
Take Care though Service scale-ups like this
have revealed new weaknesses, behaviors.
Further Steps towards Maturing the D0 Grid
Production System as a Service
More Robust, Capable, and Manageable System
requiring less effort to use.
Enable Service Management Functions Capacity
Planning, Managed Growth.
Capacity Management can sensibly lead to a more
formal statement of service levels.