STAR Computing Status, Outsource Plans, Residual Needs - PowerPoint PPT Presentation

1 / 34
About This Presentation
Title:

STAR Computing Status, Outsource Plans, Residual Needs

Description:

Lee Barnby, Kent, QA and production. Jerome Baudot, Strasbourg, SSD. Selemon Bekele, OSU, SVT ... Jamie Dunlop, Yale, RICH. Patricia Fachini, Sao Paolo/Wayne, SVT ... – PowerPoint PPT presentation

Number of Views:86
Avg rating:3.0/5.0
Slides: 35
Provided by: torrew
Category:

less

Transcript and Presenter's Notes

Title: STAR Computing Status, Outsource Plans, Residual Needs


1
STAR Computing Status,Out-source Plans, Residual
Needs
  • Torre Wenaus
  • STAR Computing Leader
  • BNL
  • RHIC Computing Advisory Committee Meeting
  • BNL
  • October 11, 1999

2
Outline
  • STAR Computing Status
  • Out-source plans
  • Residual needs
  • Conclusions

3
Manpower
  • Very important development in last 6 months Big
    new influx of postdocs, students into computing
    and related activities
  • Increased participation and pace of activity in
  • QA
  • online computing
  • production tools and operations
  • databases
  • reconstruction software
  • Planned dedicated database person never hired
    (funding) databases consequently late but we are
    now transitioning from an interim to our final
    database
  • Still missing online/general computing systems
    support person
  • Open position cancelled due to lack of funding
  • Shortfall continues to be made up by the local
    computing group

4
Some of our Youthful Manpower
A partial list of young students and postdocs now
active in aspects of software...
  • Amy Hummel, Creighton, TPC, production
  • Holm Hummler, MPG, FTPC
  • Matt Horsley, Yale, RICH
  • Jennifer Klay, Davis, PID
  • Matt Lamont, Birmingham, QA
  • Curtis Lansdell, UT, QA
  • Brian Lasiuk, Yale, TPC, RICH
  • Frank Laue, OSU, online
  • Lilian Martin, Subatch, SSD
  • Marcelo Munhoz, Sao Paolo/Wayne, online
  • Aya Ishihara, UT, QA
  • Adam Kisiel, Warsaw, online, Linux
  • Frank Laue, OSU, calibration
  • Hui Long, UCLA, TPC
  • Vladimir Morozov, LBNL, simulation
  • Alex Nevski, RICH
  • Sergei Panitkin, Kent, online
  • Caroline Peter, Geneva, RICH

Li Qun, LBNL, TPC Jeff Reid, UW, QA Fabrice
Retiere, calibrations Christelle Roy, Subatech,
SSD Dan Russ, CMU, trigger, production Raimond
Snellings, LBNL, TPC, QA Jun Takahashi, Sao
Paolo, SVT Aihong Tang, Kent Greg Thompson,
Wayne, SVT Fuquian Wang, LBNL, calibrations Robert
Willson, OSU, SVT Richard Witt, Kent Gene Van
Buren, UCLA, documentation, tools, QA Eugene
Yamamoto, UCLA, calibrations, cosmics David
Zimmerman, LBNL, Grand Challenge
  • Dave Alvarez, Wayne, SVT
  • Lee Barnby, Kent, QA and production
  • Jerome Baudot, Strasbourg, SSD
  • Selemon Bekele, OSU, SVT
  • Marguerite Belt Tonjes, Michigan, EMC
  • Helen Caines, Ohio State, SVT
  • Manuel Calderon, Yale, StMcEvent
  • Gary Cheung, UT, QA
  • Laurent Conin, Nantes, database
  • Wensheng Deng, Kent, production
  • Jamie Dunlop, Yale, RICH
  • Patricia Fachini, Sao Paolo/Wayne, SVT
  • Dominik Flierl, Frankfurt, L3 DST
  • Marcelo Gameiro, Sao Paolo, SVT
  • Jon Gangs, Yale, online
  • Dave Hardtke, LBNL, Calibrations, DB
  • Mike Heffner, Davis, FTPC
  • Eric Hjort, Purdue, TPC

5
Status of Computing Requirements
  • Internal review (particularly simulation) in
    process in connection with evaluating PDSF
    upgrade needs
  • No major changes with respect to earlier reviews
  • RCF resources should meet STAR reconstruction and
    central analysis needs (recognizing 1.5x
    re-reconstruction factor allows little margin for
    the unexpected)
  • Existing (primarily Cray T3E) offsite simulation
    facilities inadequate for simulation needs
  • Simulation needs addressed by PDSF ramp-up plans

6
Current STAR Software Environment
  • Current software base a mix of C (55) and
    Fortran (45)
  • Rapid evolution from 20/80 in September 98
  • New development, and all physics analysis, in C
  • ROOT as analysis tool and foundation for
    framework adopted 11/98
  • Legacy Fortran codes and data structures
    supported without change
  • Deployed in offline production and analysis in
    Mock Data Challenge 2, Feb-Mar 99
  • ROOT adopted for event data store after MDC2
  • Complemented by MySQL relational DB no more
    Objectivity
  • Post-reconstruction C/OO data model StEvent
    implemented
  • Initially purely transient design unconstrained
    by I/O (ROOT or Objectivity)
  • Later implemented in persistent form using ROOT
    without changing interface
  • Basis of all analysis software development
  • Next step migrate the OO data model upstream to
    reconstruction

7
MDC2 and Post-MDC2
  • STAR MDC2
  • Full production deployment of ROOT based offline
    chain and I/O. All MDC2 production based on ROOT
  • Statistics suffered from software and hardware
    problems and the short MDC2 duration about 1/3
    of best case scenario
  • Very active physics analysis and QA program
  • StEvent (OO/C data model) in place and in use
  • During and after MDC2 Addressing the problems
  • Program size up to 850MB. Reduced to broad cleanup
  • Robustness of multi-branch I/O (multiple file
    streams) improved
  • XDF based I/O maintained as stably functional
    alternative
  • Improvements to Maker organization of component
    packages
  • Completed by late May infrastructure stabilized

8
Software Status for Engineering Run
  • Offline environment and infrastructure stabilized
  • Shift of focus to consolidation usability
    improvements, documentation, user-driven
    enhancements, developing and responding to QA
  • DAQ format data supported in offline from raw
    files through analysis
  • Stably functional data storage
  • Universal I/O interface transparently supports
    all STAR file types
  • DAQ raw data, XDF, ROOT, (Grand Challenge and
    online pool to come)
  • ROOT I/O debugging proceeded through June now
    stable
  • StEvent in wide use for physics analysis and QA
    software
  • Persistent version of StEvent implemented and
    deployed
  • Very active analysis and QA program
  • Calibration/parameter DB not ready (now 10/99
    being deployed)

9
Real Data Processing
  • Currently live detector is the TPC
  • 75 of TPC read out (beam data and cosmics)
  • Can read and analyze zero suppressed TPC data all
    the way to DST
  • real data DST read and used in StEvent post-reco
    analysis
  • Bad channel suppression implemented and tested.
  • First order alignment was worked out (1mm), the
    rest to come from residuals analysis
  • 10 000 cosmics with no field and several runs
    with field on
  • All interesting real data from engineering run
    passed through regular production reconstruction
    and QA
  • now preparing for second iteration incorporating
    improvements in reconstruction codes, calibrations

10
Event Store and Data Management
  • Success of ROOT-based event data storage from
    MDC2 on relegated Objectivity to metadata
    management role, if any
  • ROOT provides storage for the data itself
  • We can use a simpler, safer tool in metadata role
    without compromising our data model, and avoid
    complexities and risks of Objectivity
  • MySQL adopted (relational DB, open software,
    widely used, very fast, but not a full-featured
    heavyweight like ORACLE)
  • Wonderful experience so far. Excellent tools,
    very robust, extremely fast
  • Scalability OK so far (eg. 2M rows of 100bytes)
    multiple servers can be used as needed to address
    scalability needs
  • Not taxing the tool because metadata, not large
    volume data, is stored
  • Objectivity is gone from STAR

11
Requirements STAR 8/99 View (My Version)
12
RHIC Data Management Factors For Evaluation
  • My perception of changes in the STAR view from
    97 to now are shown
  • Objy RootMySQL Factor
  • ? ? Cost
  • ? ? Performance and capability as data access
    solution
  • ? ? Quality of technical support
  • ? ? Ease of use, quality of doc
  • ? ? Ease of integration with analysis
  • ? ? Ease of maintenance, risk
  • ? ? Commonality among experiments
  • ? ? Extent, leverage of outside usage
  • ? ? Affordable/manageable outside RCF
  • ? ? Quality of data distribution mechanisms
  • ? ? Integrity of replica copies
  • ? ? Availability of browser tools
  • ? ? Flexibility in controlling permanent
    storage location
  • ? ? Level of relevant standards compliance,
    eg. ODMG
  • ? ? Java access
  • ? ? Partitioning DB and resources among groups

13
STAR Production Database
  • MySQL based production database (for want of a
    better term) in place with the following
    components
  • File catalogs
  • Simulation data catalog
  • populated with all simulation-derived data in
    HPSS and on disk
  • Real data catalog
  • populated with all real raw and reconstructed
    data
  • Run log and online log
  • fully populated and interfaced to online run log
    entry
  • Event tag databases
  • database of DAQ-level event tags. Populated by
    offline scanner needs to be interfaced to buffer
    boxand extended with downstream tags
  • Production operations database
  • production job status and QA info

14
ROOT Status in STAR
  • ROOT is with us to stay!
  • No major deficiencies, obstacles found no
    post-ROOT visions contemplated
  • ROOT community growing Fermilab Run II, ALICE,
    MINOS
  • We are leveraging community developments
  • First US ROOT workshop at FNAL in March
  • Broad participation, 50 from all major US labs,
    experiments
  • ROOT team present heeded our priority requests
  • I/O improvements robust multi-stream I/O and
    schema evolution
  • Standard Template Library support
  • Both emerging in subsequent ROOT releases
  • FNAL participation in development, documentation
  • ROOT guide and training materials recently
    released
  • Our framework is based on ROOT, but application
    codes need not depend on ROOT (neither is it
    forbidden to use ROOT in application codes).

15
Software Releases and Documentation
  • Release policy and mechanisms stable and working
    fairly smoothly
  • Extensive testing and QA nightly (latest
    version) and weekly (higher statistics testing
    before dev version is released to new)
  • Software build tools switched from gmake to cons
    (perl)
  • more flexible, easier to maintain, faster
  • Major push in recent months to improve scope and
    quality of documentation
  • Documentation coordinator (coercer!) appointed
  • New documentation and code navigation tools
    developed
  • Needs prioritized pressure being applied new
    doc has started to appear
  • Ongoing monthly tutorial program
  • With cons, doc/code tools, database tools, perl
    has become a major STAR tool
  • Software by type All Modified in last
    2 months
  • C 18938 1264
  • C 115966 52491
  • FORTRAN 93506 54383
  • IDL 8261 162
  • KUMAC 5578 0
  • MORTRAN 7122 3043
  • Makefile 3009 2323
  • scripts 36188 26402

16
QA
  • Major effort during and since MDC2
  • Organized effort under QA Czar Peter Jacobs
    weekly meetings and QA reports
  • QA signoff integrated with software release
    procedures
  • Suite of histograms and other QA measures in
    continuous use and development
  • Automated tools managing production and
    extraction of QA measures from test and
    production running recently deployed
  • Acts as a very effective driver for debugging and
    development of the software, engaging a lot of
    people

17
Current Software Status
  • Infrastructure for year one pretty much there
  • Simulation stable
  • 7TB production simulation data generated
  • Reconstruction software for year one mostly there
  • lots of current work on quality, calibrations,
    global reconstruction
  • TPC in the best shape EMC in the worst (two new
    FTEs should help EMC catch up 10 installation
    in year 1)
  • well exercised in production 2.5TB of
    reconstruction output generated in production
  • Physics analysis software now actively underway
    in all working groups
  • contributing strongly to reconstruction and QA
  • Major shift of focus in recent months away from
    infrastructure and towards reconstruction and
    analysis
  • Reflected in program of STAR Computing Week last
    week predominantly reco/analysis

18
Priority Work for Year One Readiness
  • In Progress...
  • Extending data management tools (MySQL DB disk
    file management
  • HPSS file management multi-component ROOT
    files)
  • Complete schema evolution, in collaboration with
    ROOT team
  • Completion of the DB integration of slow control
    as data source,
  • completion of online integration, extension to
    all detectors
  • Extend and apply OO data model (StEvent) to
    reconstruction
  • Continued QA development
  • Reconstruction and analysis code development
  • Responding to QA results and addressing year 1
    code completeness
  • Improving and better integrating visualization
    tools
  • Management of CAS processing and data
    distribution both for mining
  • and individual physicist level analysis
  • Integration and deployment of Grand Challenge

19
STAR Analysis CAS Usage Plan
  • CAS processing with DST input based on managed
    production by the physics working groups (PWG)
    using the Grand Challenge Architecture
  • Later stage processing on micro-DSTs
    (standardized at the PWG level) and nano-DSTs
    (defined by individuals or small groups) occurs
    under the control of individual physicists and
    small groups
  • Mix of LSF-based batch, and interactive
  • on both Linux and Sun, but with far greater
    emphasis on Linux
  • For I/O intensive processing, local Linux disks
    (14GB usable) and Suns available
  • Usage of local disks and availability of data to
    be managed through the file catalog
  • Web-based interface to management, submission and
    monitoring of analysis jobs in development

20
Grand Challenge
  • What does the Grand Challenge do for the user?
  • Optimizes access to HPSS based data store
  • Improves data access for individual users
  • Allows event access by query
  • Present query string to GCA (e.g.
    NumberLambdas1)
  • Receive iterator over events which satisfy query
    as files are extracted from HPSS
  • Pre-fetches files so that the next file is
    requested from HPSS while you are analyzing the
    data in your first file
  • Coordinates data access among multiple users
  • Coordinates ftp requests so that a tape is staged
    only once per set of queries which request files
    on that tape
  • General user-level HPSS retrieval tool

21
Grand Challenge queries
  • Queries based on physics tag selections
  • SELECT (component1, component2, )
  • FROM dataset_name
  • WHERE (predicate_conditions_on_properties)
  • Example
  • SELECT dst, hits
  • FROM Run00289005
  • WHERE glb_trk_tot0 glb_trk_tot

Event components fzd, raw, dst-xdf, dst-root,
hits, StrangeTag, FlowTag, StrangeMuDst,
Mapping from run/event/component to file via
the database GC index assembles tags component
file locations for each event Tag based query
match yields the files requiring retrieval to
serve up that event Event list based queries
allow using the GCA for general-purpose
coordinated HPSS retrieval
Event list based retrieval SELECT dst, hits Run
00289005 Event 1 Run 00293002 Event 24 Run
00299001 Event 3 ...
22
Grand Challenge in STAR
23
STAR GC Implementation Plan
  • Interface GC client code to STAR framework
  • Already runs on solaris, linux
  • Needs integration into framework I/O management
  • Needs connections to STAR MySQL DB
  • Apply GC index builder to STAR event tags
  • Interface is defined
  • Has been used with non-STAR ROOT files
  • Needs connection to STAR ROOT and mysql DB
  • (New) manpower for implementation now available
  • Experienced in STAR databases
  • Needs to come up to speed on GCA

24
Current STAR Status at RCF
  • Computing operations during the engineering run
    fairly smooth, apart from very severe security
    disruptions
  • Data volumes small, and direct DAQ-RCF data path
    not yet commissioned
  • Effectively using the newly expanded Linux farm
  • Steady reconstruction production on CRS
    transition to year 1 operation should be smooth
  • New CRS job management software deployed in MDC2
    works well and meets our needs
  • Analysis software development and production
    underway on CAS
  • Tools managing analysis operations under
    development
  • Integration of Grand Challenge data management
    tools into production and physics analysis
    operations to take place over the next few months
  • Not needed for early running (low data volumes)

25
Concerns RCF Manpower
  • Understaffing directly impacts
  • Depth of support/knowledge base in crucial
    technologies, eg. AFS, HPSS
  • Level and quality of user and experiment-specific
    support
  • Scope of RCF participation in software Much less
    central support/development effort in common
    software than at other labs (FNAL, SLAC)
  • e.g. ROOT used by all four experiments, but no
    RCF involvement
  • Exacerbated by very tight manpower within the
    experiment software efforts
  • Some generic software development supported by
    LDRD (NOVA project of STAR/ATLAS group)
  • The existing overextended staff is getting the
    essentials done, but the data flood is still to
    come
  • Concerns over RCF understaffing recently
    increased with departure of Tim Sailer

26
Concerns Computer/Network Security
  • Careful balance required between ensuring
    security and providing a productive and capable
    development and production environment
  • Not yet clear whether we are in balance or have
    already strayed to an unproductive environment
  • Unstable offsite connections, broken farm
    functionality, database configuration gymnastics,
    farm (even interactive part) cut off from the
    world), limited access to our data disks
  • Experiencing difficulties, and expecting new
    ones, from particularly the private subnet
    configuration unilaterally implemented by RCF
  • Need should be (re)evaluated in light of new lab
    firewall
  • RCF security closely coupled to overall lab
    computer/network security coherent site-wide
    plan, as non-intrusive as possible, is needed
  • We are still recovering from the knee-jerk slam
    the doors response of the lab to the August
    incident
  • Punching holes in the firewall to enable work to
    get done
  • I now regularly use PDSF_at_NERSC when offsite to
    avoid being tripped up by BNL security

27
Other Concerns
  • HPSS transfer failures
  • During MDC2 in certain periods up to 20 of file
    transfers to HPSS failed dangerously
  • transfers seem to succeed no errors and the file
    seemingly visible in HPSS with the right size
  • but on reading we find the file not readable
  • John Riordan has list of errors seen during
    reading
  • In reconstruction we can guard against this, but
    it would be a much more serious problem for DAQ
    data cannot afford to read back from HPSS to
    check its integrity.
  • Continuing networking disruptions
  • A regular problem in recent months network
    dropping out or very slow for unknown/unannounced
    reasons
  • If unintentional bad network management
  • If intentional bad network management

28
Public Information and Documentation Needed
  • Clear list of services RCF provides, the level of
    support of these services, resources allocated to
    each experiment, personnel support responsibles
  • - rcas LSF, ....
  • - rcrs CRS software, ....
  • - AFS (stability, home directories, ...)
  • - Disks (inside / outside access)
  • - HPSS
  • - experiment DAQ/online interface
  • Web based information is very incomplete
  • e.g. information on planned facilities for year
    one and after
  • largely a restatement of first point
  • General communication
  • RCF still needs improvement in general user
    communication, responsiveness

29
Outsourcing Computing in STAR
  • Broad local RCF, BNL-STAR and remote usage of
    STAR software. STAR environment setup counts
    since early August
  • RCF 118571 BNL Sun 33508 BNL Linux
    13418 Desktop 6038
  • HP 801 LBL 29308 Rice
    12707 Indiana 19852
  • Non-RCF usage currently comparable to RCF usage
    good distributed computing support is essential
    in STAR
  • Enabled by AFS based environment AFS an
    indispensable tool
  • But inappropriate for data access usage
  • Agreement reached with RCF for read-only access
    to RCF NFS data disks from STAR BNL computers
    seems to be working well
  • New BNL-STAR facilities
  • 6 dual 500MHz/18GB (2 arrived), 120GB disk
  • For software development, software and
    OS/compiler testing, online monitoring, services
    (web, DB,)
  • Supported and managed by STAR personnel
  • Supports STAR environment for Linux desktop boxes

30
STAR Offsite Computing PDSF
  • pdsf at LBNL/NERSC
  • Virtually identical configuration to RCF
  • Intel/Linux farm, limited Sun/Solaris, HPSS based
    data archiving
  • Current (10/99) scale relative to STAR RCF
  • CPU 50 (1200 Si95), disk 85 (2.5TB)
  • Long term goal resources equal to STARs share
    of RCF
  • Consistent with long-standing plan that RCF hosts
    50 of experiments computing facilities
    simulation and some analysis offsite
  • Ramp-up currently being planned
  • Other NERSC resources T3Es major source of
    simulation cycles
  • 210,000 hours allocated in FY00 one of the
    larger allocations in terms of CPU and storage
  • Focus in future will be on PDSF no porting to
    next MPP generation

31
STAR Offsite Computing PSC
  • Cray T3Es at Pittsburgh Supercomputing Center
  • STAR Geant3 based simulation used at PSC to
    generate 4TB of simulated data in support of
    Mock Data Challenges and software development
  • Supported by local CMU group
  • Recently retired when our allocation ran out and
    could not be renewed
  • Increasing reliance on PSC

32
STAR Offsite Computing Universities
  • Physics analysis computing at home institutions
  • Processing of micro-DSTs and DST subsets
  • Software development
  • Primarily based on small Linux clusters
  • Relatively small data volumes aggregate total of
    10TB/yr
  • Data transfer needs of US institutes should be
    met by net
  • Overseas institutes will rely on tape based
    transfers
  • Existing self-service scheme will probably
    suffice
  • Some simulation production at universities
  • Rice, Dubna

33
Residual Needs
  • Data transfer to PDSF and other offsite
    institutes
  • Existing self-service DLT probably satisfactory
    for non-PDSF tape needs, but little experience to
    date
  • 100GB/day network transfer rate today adequate
    for PDSF/NERSC data transfer
  • Future PDSF transfer needs (network, tape) to be
    quantified once PDSF scale-up is better understood

34
Conclusions STAR at RCF
  • Overall RCF is an effective facility for STAR
    data processing and management
  • Sound choices in overall architecture, hardware,
    software
  • Well aligned with the HENP community
  • Community tools and expertise easily exploited
  • Valuable synergies with non-RHIC programs,
    notably ATLAS
  • Production stress tests have been successful and
    instructive
  • On schedule facilities have been there when
    needed
  • RCF interacts effectively and consults
    appropriately for the most part, and is generally
    responsive to input
  • Weak points are security issues and interactions
    with general users (as opposed to experiment
    liaisons and principals)
  • Mock Data Challenges have been highly effective
    in exercising, debugging and optimizing RCF
    production facilities as well as our software
  • Based on status to date, we expect STAR and RCF
    to be ready for whatever RHIC Year 1 throws at
    us.
Write a Comment
User Comments (0)
About PowerShow.com