D0 Production Status and Plans - PowerPoint PPT Presentation

1 / 12
About This Presentation
Title:

D0 Production Status and Plans

Description:

Only run a maximum of 400 processors under Grid mode. Expect to reconfigure system to ... SAMGrid which provides common run-time environment and common submission interface as ... – PowerPoint PPT presentation

Number of Views:15
Avg rating:3.0/5.0
Slides: 13
Provided by: MichaelD116
Category:

less

Transcript and Presenter's Notes

Title: D0 Production Status and Plans


1
D0 Production Status and Plans
  • Michael Diesburg
  • Run II Computing Review
  • Sept 1, 2005

2
D0 Production Status and Plans
  • Current FNAL Farm Capacity and Performance
  • Expected Configuration Changes and Expansion
  • Remote SAMGrid Reprocesing
  • Monte Carlo Production

3
Current FNAL Farm
  • Current farm has compute capacity of 1550 GHz
    (PIII equivalent)
  • Mixture of PIII, Athlon, Xeon processors (448
    dual processor nodes)
  • 8 processor SGI Origin acts as central file
    server, output staging node
  • 12 nodes used exclusively for input staging,
    output file merging
  • 1 node used as SAMGrid head node
  • Nodes are distributed in three locations
  • FCC - SGI and older PIII nodes
  • NML - Athlon nodes (240)
  • HDCF - Newer Xeon nodes
  • Distributing nodes has worked without problem,
    but infrastructure in NML has not been stable
  • Frequent power and cooling outages
  • Will move nodes from NML to HDCF October

4
Current FNAL Farm
  • Farm uses two co-existing operational modes
  • Traditional scripts written specifically for
    local farm
  • SAMGrid mode use same Grid installation as our
    remote sites
  • Both modes can operate simultaneously, sharing
    nodes and cache
  • Need to ensure continuity of operation required
    non-optimal installation of SAMGrid configuration
  • Only run a maximum of 400 processors under Grid
    mode
  • Expect to reconfigure system to optimize Grid
    operation
  • Will also enable SAMGrid Monte Carlo production
    when reconfiguration is done

5
Current FNAL Farm
  • Significant effort was put into improving
    execution speed of reconstruction program at high
    luminosities
  • P17 version is much improved over p14 version
    both in speed and robustness
  • Factor of 2 faster at 0.8E32 luminosity (see
    Figure 1)
  • P17.03.03 (used for reprocessing effort) has
    failure rate lt 0.05
  • P17.05.01 (used for post-shutdown data) has even
    lower failure rate
  • Currently using 50 of capacity to keep up with
    new data
  • Assuming 80 farm efficiency and maximum detector
    output of 3.5M events/day (corresponds to 50Hz
    max)
  • Can survive up to 0.6E32 at 100 accelerator duty
    cycle
  • Can survive at least to 1.2E32 at 33 duty cycle
  • Extrapolation to 2.0E32 difficult to do reliably
    with existing information

6
Expansion Plans
  • Will retire SGI server node in the next year.
  • Batch system has already been moved to another
    node
  • SAM station will be moved dedicated worker
  • Disk buffers on SGI will be moved to 2TB Linux
    raid array
  • Reconfiguration will be done attention to
    optimizing installation for SAMGrid operation.
  • Will not migrate old scripts to new configuration
    if we dont have to
  • Will add 140 dual processor worker nodes this
    year (3 GHz equivalent)
  • Total capacity will go to 2390 GHz
  • 240 nodes will fall off warranty in November
  • Will not be repaired. Expect a high attrition
    rate
  • Additional nodes to be added in FY 2006 TBD

7
Remote SAMGrid Reprocessing
  • Plans were made to reprocess all data taken up to
    the Nov 2004 shutdown with p17 version of reco
  • Requires far more resources than available on
    FNAL farm
  • 10X larger project than remote portion of p14
    reprocessing in 2003
  • 1 B events, 250 TB raw data
  • Requires CALIB DB access
  • Merging and final store to SAM to be done
    remotely
  • Wanted uniform, reusable, processing tools used
    at all sites
  • I.e. needed Grid technology solution
  • Used SAMGrid which provides common run-time
    environment and common submission interface as
    well as monitoring tools
  • Does require some D0 specific installations at
    remote sites (SAM station, DB proxy servers, job
    manager)
  • Minor tweaking of installation will allow other
    types of processing, e.g. Monte Carlo production

8
Remote SAMGrid Reprocessing
  • Actual processing started on March 25th
  • Currently have 10 remote sites in operation
  • CCIN2P3 (Lyon) CMS-FNAL
  • FZU (Prague) GridKa (Karlsruhe)
  • Imperial OSCER (Oklahoma)
  • SPRACE (Sao Paolo) UTA (Texas, Arlington)
  • WestGrid (Canada) Wisconsin
  • Improved speed of p17 has also allowed use of
    spare cycles on FNAL farm
  • Have finished 820M events
  • Expect to finish in mid October
  • SAMGrid installation can also be used for MC
    production, so we will be ready to shift
    production to MC as soon as reprocessing is
    finished
  • Some sites already shifting part of resources to
    MC

9
Remote SAMGrid Reprocessing
  • Some lessons learned from Grid reprocessing
    effort
  • Network is still a bottleneck.
  • Needed to prestage data to remote sites for
    efficient operation
  • Certification of remote sites is difficult.
  • Requires good certification tools.
  • Initial installation requires expert help.
  • Each new site is an adventure with unique
    problems and constraints
  • Entire operation is very manpower intensive
  • 1 FTE for each remote site
  • But it can be done.

10
Monte Carlo Production
  • Success of reprocessing effort has come partially
    at the expense of Monte Carlo production
  • Significant fraction of our resources were
    redirected from MC production to reprocessing
  • MC production dropped from 14M to 2M
    events/month once reprocessing started
  • Shift form p14 to p17 version of MC has also
    slowed production
  • Good news is we still produced about twice as
    many events last year as the year before
  • Produced 75M events last year. But its
    never enough
  • Expect significant increase in resources
    dedicated to MC as reprocessing finishes

11
Monte Carlo Production
  • Cast of players for MC production is similar to
    reprocessing
  • CCIN2P3 FZU
  • GridKa
    Lancaster
  • Langston U
    Louisiana Tech
  • Manchester
    Nikhef/LCG
  • Oklahoma Sao
    Paolo
  • Tata
    UTA
  • Wisconsin
  • Sites that were using MCFarm job manager are
    being converted to common SAMGrid tools
  • Nikhef/LCG is significant as it acts as gateway
    to LCG resources
  • Good demonstration that non-D0 resources can be
    used for MC production
  • Similar gateway mode can be used for OSG resources

12
Monte Carlo Production
  • Expect significant resources to be available for
    MC production as reprocessing ramps down
  • SAMGrid installation used for reprocessing can be
    used for MC with minor modifications
  • Sites previously not involved in MC production
    can contribute (Fermilab farms, CMS, WestGrid)
  • Additional sites are adding resources that will
    come on line soon (Oklahoma, Rio De Janeiro, LSU,
    U. Mississippi)
  • Not unreasonable to expect 20M events/month
  • Interoperability of SAMGrid with LCG and OSG
    should also provide ample opportunity to use
    non-D0 resources
Write a Comment
User Comments (0)
About PowerShow.com