McFarm Improvements and Re-processing DST - PowerPoint PPT Presentation

1 / 11
About This Presentation
Title:

McFarm Improvements and Re-processing DST

Description:

Batch Queues - PBS and Condor. April 7, 2004. McFarm ... Centralized queue-status for PBS and Condor. Better info from failed jobs to promote recycling ... – PowerPoint PPT presentation

Number of Views:24
Avg rating:3.0/5.0
Slides: 12
Provided by: drewm6
Category:

less

Transcript and Presenter's Notes

Title: McFarm Improvements and Re-processing DST


1
McFarm Improvements and Re-processing DSTs
  • D. Meyer for The UTA Team
  • 3rd DØ SAR Workshop
  • Louisiana Tech. University
  • 4/7 - 4/9/2004

2
Reasons for Using McFarm
  • McFarm is DØ MC Cluster-Control Software
    developed at UTA and used in DØSAR farms
  • Simplifies Monte Carlo Production and
    Re-Processing
  • Manages a cluster efficiently with minimum labor
  • Minimizes impact of changes to SAM, mc_runjob,
    and other DØ software
  • User (Operator) Oriented

3
McFarm Software Integration
  • DØ Binaries - minitars or full release
  • SAM and SAM-Grid - declaration, storage,
    retrieval, remote job submission
  • mc_runjob - job and metadata construction
  • NFS - access to binaries, minbias data sample
  • NIS - account management
  • ssh - intra-cluster monitoring and control
  • Batch Queues - PBS and Condor

4
Improvements Procedural Changes
  • No longer storing d0gstar files, just reco and
    merged-tmb.
  • Farmers should replace all failed jobs even if
    the gen file was supplied by the requestor
  • SAM-Admin no longer needed

5
Bug Fixes
  • All log files purged by McFarm
  • Gathering congestion due to large backlogs
  • Merging was also having problems
  • Spurious gathering failures fixed
  • PBS-submissions are now re-tried.
  • Max Children (threads) now working on server
  • Other fixes improved robustness

6
Enhancements
  • Integrated with mc_runjob V06
  • Handles d0mess jobs normally
  • Centralized queue-status for PBS and Condor
  • Better info from failed jobs to promote recycling
  • New disposition CACHEHOLD
  • SAM-Grid preparation remote_execute script
    tested by Hyun Woo
  • DST Reprocessing Implemented and Certified
    (www-d0.fnal.gov/computing/reprocessing/recocert/i
    ndex.html )

7
Enhancements - 2
  • Many improvements to launch_request
  • Allow job-type default
  • Adapted to new Queue.py
  • Target event-range to re-cycle failed job
  • Run multiple launches on multiple nodes
  • Performs get_request automatically from SAM

8
Re-Processing DSTs
  • Basic approach is to run d0reco binary only and
    input DST file, output reco and tmb.
  • Farmer supplies a request script which contains
    dataset-definition, framework RCP
  • launch_request distinguishes between an MC task
    and a Re-Processing task, and defaults to
    CACHEHOLD output disposition for re-proc

9
Re-Processing DSTs - 2
  • Initial DST-reprocessing trial focused on getting
    certified by matching output results, is now
    concluded.
  • UTA will assist other farms in certification
  • Next round of re-processing will focus on raw
    data, with attendant issues of new software and
    possibly database access

10
To Be Done
  • All farms should be certified for DST as a
    springboard to the next round RAW data.
  • UTA will participate in next round, promote the
    use of the request script (needs SAM changes).
  • TMB fix project in progress
  • SAM-Grid can be used to feed farms automatically,
    McFarm can accept such requests
  • More automation in the area of error recycling

11
Conclusions
  • McFarm continues to respond to the changing
    landscape for DØ processing
  • If all DØSAR farms get certified for
    re-processing, it would be a significant addition
    to the capacity.
  • IACs use and comments prompted McFarm
    improvements (Thank you everyone!!)
  • Comments always appreciated
Write a Comment
User Comments (0)
About PowerShow.com