Parallel Reconstruction of CLEO III Data - PowerPoint PPT Presentation

1 / 14
About This Presentation
Title:

Parallel Reconstruction of CLEO III Data

Description:

Once all sub-jobs complete, collate binary files into database in event-number order ... completes successfully the JM starts the collation sub-job ... – PowerPoint PPT presentation

Number of Views:20
Avg rating:3.0/5.0
Slides: 15
Provided by: gregory178
Category:

less

Transcript and Presenter's Notes

Title: Parallel Reconstruction of CLEO III Data


1
Parallel Reconstruction ofCLEO III Data
  • Gregory J. Sharp
  • Christopher D. Jones
  • Wilson Synchrotron Laboratory
  • Cornell University

2
Outline
  • Overview of CLEO reconstruction environment
  • The problems with the old reconstruction system
  • The solution - finer-grained parallelism
  • The benefits

3
CLEO IIIReconstruction Environment
  • Uses a farm of more than 130 Sun Netras
  • Sun Grid Engine manages CPU allocation
  • Data read from written to Objectivity/DB
  • Events must be written to DB in event-number
    order
  • Reconstruction rate has to equal average DAQ rate

4
Former Reconstruction System
  • Output was written directly to the offline
    database
  • 130 runs may be processed in parallel on the
    farm
  • Each run is processed in its entirety by a
    single CPU
  • Up to 9 days to reconstruct a single run on a
    single CPU
  • All failures required operator and/or DBA
    intervention

5
Problems
  • Need to maximize CPU utilization
  • Load balancing between farms is difficult
  • Takes a long time to stop the farm safely
  • Output of the first few runs must be checked
  • Debugging reconstruction code

6
More Problems
  • Low I/O rates to the database
  • Many locks held for long periods
  • Large window for failures to occur
  • Failure leaves database in an invalid state
  • No automation of failure detection and recovery

7
The Solution
  • Split each run into roughly equal-sized chunks
  • Assign each chunk to a CPU
  • Save sub-job output in intermediate binary files
    in event-number order
  • Once all sub-jobs complete, collate binary files
    into database in event-number order

8
The Job Manager
  • The JM submits all the reconstruction sub-jobs
    and monitors their progress, retrying failures
  • Once all reconstruction completes successfully
    the JM starts the collation sub-job
  • Once collation completes successfully the JM
    starts the merge histogram sub-job
  • Can be restarted at any time if it dies

9
Structure Diagram
10
Automation
  • JM restarts subjobs with transient failures
  • Runs may be submitted automatically when SGE
    queue is (almost) empty
  • A cron job generates status web pages

11
Implementation Details
  • Written in Perl
  • Uses Sun Grid Engine to submit and track jobs
  • Uses CLEO III software infrastructure for
    reconstruction and population
  • Uses PAW for merging histograms

12
Benefits
  • Less operator intervention/management
  • Faster debugging
  • Increased CPU utilization, which offsets extra
    CPU use
  • 20 Faster completion of reconstruction
  • Just-in-time pre-staging of data from HSM file
    system
  • The January ice storm

13
Future Steps
  • Automate staging of data to cache disks
  • Automate posting of staged runs info to
    Reconstruction

14
Conclusions
  • Multiple file formats made this possible
  • Substantial productivity gains
  • Higher utilization of computing resources
  • For more details
  • http//www.lepp.cornell.edu/gregor/projects/p
    arallelpass2
Write a Comment
User Comments (0)
About PowerShow.com