Advanced software methods for physics analysis - PowerPoint PPT Presentation

About This Presentation
Title:

Advanced software methods for physics analysis

Description:

Advanced software methods for physics analysis Luca Lista INFN Napoli Introduction Unprecedented data analysis complexity are experienced in BaBar today, and LHC in ... – PowerPoint PPT presentation

Number of Views:110
Avg rating:3.0/5.0
Slides: 34
Provided by: LucaL2
Category:

less

Transcript and Presenter's Notes

Title: Advanced software methods for physics analysis


1
Advanced software methods for physics analysis
  • Luca Lista
  • INFN Napoli

2
Introduction
  • Unprecedented data analysis complexity are
    experienced in BaBar today, and LHC in the future
  • Huge data samples
  • Large number of analyses
  • High level of analysis complexity
  • Analysis software and data model must have a high
    level of reliability
  • Provide the users community with a set powerful,
    generic, simple to use, tools to analyze data
  • Core software technicalities must be hidden to
    the end user behind well defined interfaces
  • Code reuse must be pursued to permit users to
    share what has been developed
  • A good Object Oriented architecture is a solution
  • N.B. Good is more important than OO here

3
Outline
  • Analysis Data Model
  • BaBar experience with the new Analysis Model
  • Analysis toolkits
  • An example of fitting and toy Monte Carlo
  • Conclusions

4
BaBar experience
  • BaBar has collected up to now ?200106 BB pairs
  • Physics events are organized according to a well
    established model with different components at
    increasing levels of detail
  • Tag ( boolean bits, floats and int. ) for fast
    event filtering
  • Aod ( Micro-DST ) containing physics-level
    objects
  • Esd ( Mini-DST ) intermediate level
    reconstruction objects
  • Reconstructed tracks and neutral objects are
    accessible to the users by means of the
    information provided by BtaCandidates, which
    represent particle candidates
  • Event store technology originally based on
    Objectivity
  • Showed to be unfit for analysis purposes
  • Almost every physics analysis is based on Tag and
    Aod levels.
  • ROOT-IO supported early for those components

5
The original Analysis Model
  • The physics code in the BaBar Framework
  • Performs particle identification and build
    intermediate composite particles (?0, KS, D,
    resonances)
  • Composite particles are built by means of a
    standard (i.e. validated) and configurable set of
    tools
  • This task is intrinsically very expensive in term
    of CPU time because of the high combinatorics
    involved.
  • The output of the process (containers of
    composite particles candidates) was available
    only at the transient level
  • Very often physics analyses take advantage of an
    intermediate step of data reduction (skimming)
  • Performs further data reduction on individual
    analysis basis
  • Eventually manage the production of large
    PAW/ROOT n-tuples which contain specific
    analysis level physics variables, composite
    candidates kinematics and vertexing

6
Old model achievements
  • This model has been so far very successful in
    producing high quality physics results
  • Submitted more than 90 physics papers, which
    include discoveries!
  • But experts believe that it wouldnt scale well
    to 5x current data set
  • Moreover, analysis at the n-tuple level has the
    disadvantage that existing (and validated) code
    is lost
  • This leads to replication of code
  • Sharing analysis tools and strategies is
    challenging

7
The new Analysis Model
  • Makes the Event Store content configurable
  • Support for the storage of composite candidates
  • More flexibility support addition of custom user
    data
  • Expand the role of Skimming (collections of
    pre-selected events)
  • Rewrite just customized components
  • Borrow components from parent collections
  • Provide both pointer-copy full deep-copy for
    light skims
  • Better performance in reading, off-site easy
    portability
  • The possible scenario
  • Analysis working groups may instrument their
    skims with high level added data and specific
    composite candidates
  • Final users may produce their own reduced skims
    (re-skimming) with their options of user added
    data and composite candidates depending on their
    actual needs
  • Frequent centralized skimming to reduce the delay
    for availability of improvements, new skims, etc.
  • Interactively access supported to skim data
    (ROOT) for analysis!
  • Production of large n-tuples becomes less
    appealing!

8
Composite particles candidates
  • PAW n-tuples was used as an additional storage
    format to cope with lack of needed
    functionalities
  • Analysts use to store composite candidates and
    high level physics quantities in the n-tuples
  • Often replicating most of the Event Store
    content.
  • Every analysis working group has its own large
    heterogeneous PAW/ROOT files (disk space
    inefficient) and formats
  • With the new model, candidates particles
    genealogy, the fit algorithm and the constrains
    are written to the DST
  • This design choice rewards disk space saving with
    some cost in terms of reading performances
  • When reading back, all the persistent candidates
    are translated in transient objects
  • The performance issues of re-fitting all the
    transient candidates may be mitigated supporting
    dynamic data access
  • Also known as Reconstruction on demand

9
User added data - requirements
  • Support for storing user-defined variables
  • Widely used physics analysis variables in BaBar
  • mES, ?E, ?mD,D, cos?B,Dl,
  • Guarantee flexibility and uniformity in the
    interface
  • Support for native types
  • float, int,
  • As well as for more complex objects
  • ThreeVector, LorentzVector,
  • User data should refer to analysis objects or be
    global to the entire event
  • Non-intrusive design
  • To extend functionalities (old good Open/Closed
    OO principle)
  • Simplicity from the point view of the user
  • Technological details are hidden as much as
    possible

10
User Added Data Interface
  • The requirements are fulfilled by systematic use
    of Generic Programming (C templates)
  • User variables of generic type
  • Global event data
  • Data associable to generic objects
  • Specializations for BtaCandidate case (the BaBar
    particle candidate object)
  • All native and most common structured variable
    types and for BtaCandidate objects can be stored
    in the DST
  • User has to deal with few concepts and classes
  • Variables by means of UsrVariableltTgt template
    class
  • Blocks aggregating variables values and
    associations
  • UsrEventBlock class to aggregate quantities
    relevant to the whole event
  • UsrCandBlock class to group variables associated
    to candidates
  • static put/get functions to interact with the
    event store

11
User defined candidate variables
begin of the job Declare variables
  • UsrVariableltfloatgt mES( mES )
  • UsrVariableltfloatgt deltaE( deltaE )
  • UsrCandBlock B0Data
  • for each BtaCandidate cand
  • bool ok true
  • mES ... deltaE ... // compute the values
  • ok B0Data.put( cand, mES )
  • ok B0Data.put( cand, deltaE )
  • for each BtaCandidate cand
  • bool found B0Data.get( cand, mES )
  • // get candidate mES
  • found B0Data.get( cand, deltaE )
  • // get candidate deltaE

Analysis groups writing data
Analyst reading data
12
Status of the new Analysis Model
  • The model is completed and in production
  • Physics results for the next conferences are
    being produced using the new Analysis Model

13
A toolkit for Fit and toy Monte Carlo
  • The toolkit provides
  • a language to describe and model complex
    parametric fit problems in C
  • utilities to study the fit frequentistic
    properties
  • Not intended to provide new mathematical
    algorithms
  • The underlying minimization engine is Minuit
  • Interest on this tool from
  • BaBar, GEANT4, LCG-PI (to be released)

14
Main functionalities
  • Description of Probability Distribution Functions
    (PDF)
  • most common PDFs provided (Gaussian, Poisson,
    etc.)
  • random number generators for each provided PDF
  • utilities to combine and manipulate PDFs
  • Manipulation of symbolic expression
  • simplifies the definition of PDF models and fit
    functions
  • Fitter tools
  • different Unbinned Maximum Likelihood (UML)
    fitters and Chi-square fitter supported
  • Toy Monte Carlo
  • utility to generate random data samples to
    validate the fit results (pull distribution, fit
    bias estimate, etc.)
  • User-defined components can be easily plugged-in

15
Design choices
  • The code is optimized for speed
  • Toy Monte Carlo of complex fits are very CPU
    intensive
  • It can be achieved without loosing good OO design
  • avoid virtual functions where not necessary
  • using template generic programming
  • the Boost C library provides powerful tools
  • Metaprogramming permits type manipulations at
    compile time
  • User dont see these technical detail in the
    interface
  • External package dependencies are well isolated
  • Random number generator engines (ROOT, CLHEP, )
  • Minuit wrapper (ROOT, )
  • Other minimizers may be adopted (NAG, )

16
PDF interface
A PDF implements the () operator P f( x,
y, ) Users can define new PDFs
respecting the above interface
  • struct Flat
  • PdfFlat( double a, double b )
  • min( a ), max( b )
  • double operator()( double x ) const
  • return ( x lt min x gt max ?
  • 0
  • 1 / ( max - min ) )
  • double min, max
  • struct Poissonian
  • PdfPoissonian( double m )
  • mean( m )
  • double operator()( int n ) const
  • return ( exp( - mean )
  • pow( mean, n ) /
  • factorial( n ) )
  • double mean

17
Random number generators
Implements the generate method r.generate(
x, y, )
  • templatelt typename Generator DefaultGengt
  • struct RandomGeneratorlt Flat, Generator gt
  • RandomGenerator( const Flat pdf )
  • _min( pdf.min ), _max( pdf.max )
  • void generate( double x ) const
  • x GeneratorshootFlat( _min, _max )
  • private
  • const double _min, _max
  • Users can define new generators with the
    preferred method
  • Numerical implementations are provided
  • trapezoidal PDF sampling
  • hit or miss technique

RANDOM_GENERATOR_SAMPLE(MyPdf, Bins, Min, Max)
RANDOM_GENERATOR_HITORMISS(MyPdf, Min, Max, fMax)
18
Combining PDFs
  • Argus shoulder ( 5.20, 5.28, -0.1 )
  • Gaussian peak( 5.28, 0.05 )
  • typedef MixtureltGaussian, Argusgt Mix
  • Mix pdf( peak, shoulder, 0.1 )
  • RandomGeneratorltMixgt rnd
  • double x
  • rnd.generate( x )
  • Gaussian sigX( 5.28, 0.05 )
  • Gaussian sigY ( 0, 0.015 )
  • typedef IndependentltGaussian, Gaussiangt SigXY
  • RandomGeneratorltSigXYgt rndXY
  • double x, y
  • rndXY.generate( x, y )
  • Transformation of variables is also supported
  • Random variables are be generated in the original
    coordinate system, then transformed

Define the relevant language of the problem in
C!
19
Fit PDF parameters and run Toy MC
  • const int sig 100
  • double mean 0, sigma 1
  • Gaussian pdf( mean, sigma )
  • LikelihoodltGaussiangt like( pdf )
  • UMLParameterFitterltLikelihoodltGaussiangt gt
    fitter( like )
  • fitter.addParameter( "mean", pdf.mean )
  • fitter.addParameter( "sigma", pdf.sigma )
  • Poissonian num( sig ) // alternative Constant
  • Gaussian pdfExp( mean, sigma )
  • ExperimentltPoissonian, Gaussiangt experiment(
    num, pdfExp )
  • for ( int i 0 i lt 50000 i )
  • SampleltLikelihoodtypesgt sample
  • experiment.generate( sample )
  • double par 2 mean, sigma , err 2
    1, 1 , logLike

20
Parameter fit Results (Pulls)
There is a bias (as expected) ?2 1/n?i(xi-?)2
? 1/n-1?i(xi-?)2
21
Combined Yield and parameter fit
  • const int sig 10, bkg 5
  • typedef Poissonian Fluctuation
  • Fluctuation fluctuationSig( sig ),
    fluctuationBkg( bkg )
  • typedef Independentlt Gaussian,
  • Gaussian gt PdfSig
  • typedef Independentlt Flat,
  • Flat gt PdfBkg
  • Gaussian g1( 0, 1 ), g2( 0, 0.5 )
  • Flat f1( -5, 5 ), f2( -5, 5 )
  • Sig pdfSig( g1, g2 )
  • Bkg pdfBkg( f1, f2 )
  • typedef ExperimentltFluctuation,
  • Siggt ToySig
  • typedef ExperimentltFluctuation,
  • Bkggt ToyBkg
  • ToySig toySig( fluctuationSig, pdfSig )
  • ToyBkg toyBkg( fluctuationBkg, pdfBkg )
  • Gaussian G1( 0, 1 )
  • Sig pdfSig1( G1, g2 )
  • Likelihood like( pdfSig1, pdfBkg )
  • UMLYieldAndParameterFitterltLikelihoodgt fitter(
    like )
  • fitter.addParameter( "mean", G1.mean )
  • double pull1, pull2, pull3
  • for ( int i 0 i lt 50000 i )
  • Samplelt Likelihoodtypes gt sample
  • toy.generate( sample )
  • double s sig, bkg, 0
  • double err 1, 1, 1
  • double logLike fitter.fit(
  • s, err, sample )
  • pull1 ( s 0 - sig ) / err 0
  • pull2 ( s 1 - bkg ) / err 1
  • pull3 ( s 2 - 0 ) / err 2

2D Gaussian signal over a 2D flat
background Simultaneous fit of yields and
Gaussian mean
22
Conclusions
  • Object Oriented analysis and design has been
    adopted in the last decade as software technology
    for physics applications
  • but many OO solutions exist for complex
    problems like physics analysis
  • Special care needed to evaluate code flexibility,
    extensibility, interdependency,
  • The potential advantages of OO are fully
    exploited with a careful evaluation of the
    required use cases
  • Not always obvious, experience is important!
  • but more than 10 years of development have
    brought significant experience and solutions
  • OO is a mature technology in the HEP field.

23
Backup slides
24
How to improve the Analysis model ?
  • Protect existing physics level code against
    changes
  • New functionalities are added by means of a
    non-intrusive design
  • Same interface and same modus operandi as before
  • Backward compatibility with existing physics code
  • Replace Objectivity and large ntuples production
  • Support a complete ROOT-IO persistency
  • Except detector conditions that still depend on
    Objectivity
  • Migration to ROOT in the future is foreseen
  • Support interactive access ( ROOT/CINT ) from the
    same ROOT files
  • Usable as a data inspection tool

25
The new Event Store structure
Mini
ROOT persistency only
Micro
Structure has been rationalized leaving the same
transient interface
26
User defined event variables
  • Interface for global event variables the same
    mutatis mutandis

begin of the job Declare variables
UsrEventBlock AWG_Tag UsrVariableltintgt
multiplicity AWG_Tag.addVariable( multiplicity )
multiplicity some_algo() AWG_Tag.put(
multiplicity ) UsrIfdltUsrEventBlockgtput(
event, AWG_Tag,AWGTAG)
event() code
Provide to the analysis groups the functionality
to make their own TAGs easily
27
UML Yield fit
  • const int sig 10, bkg 5
  • typedef Independentlt Gaussian, Gaussian gt
    PdfSig
  • typedef Independentlt Flat, Flat gt PdfBkg
  • PdfSig pdfSig( Gaussian( 0, 1 ), Gaussian( 0,
    0.5 ) )
  • PdfBkg pdfBkg( Flat( -5, 5 ), Flat( -5, 5 ) )
  • typedef ExtendedLikelihood2lt PdfSig, PdfBkg gt
    Likelihood
  • Likelihood like( pdfSig, pdfBkg )
  • UMLYieldFitterlt Likelihood gt fitter( like )
  • typedef Poissonian Fluctuation //
    alternative Constant
  • Fluctuation fluctuationSig( sig ),
    fluctuationBkg( bkg )
  • typedef Experimentlt Fluctuation, PdfSig gt
    ToySig
  • typedef Experimentlt Fluctuation, PdfBkg gt
    ToyBkg
  • ToySig toySig( fluctuationSig, pdfSig )
  • ToyBkg toyBkg( fluctuationBkg, pdfBkg )
  • Experiment2lt ToySig, ToyBkg gt toy( toySig,
    toyBkg )

28
Yield fit Results (Pulls)
ltsgt 10
ltbgt 5
Discrete structure because of low statistics
Poisson fluctuation
29
Symbolic function package
  • Symbolic expressions makes the definition of PDFs
    easier

X x // declare the variable x //
normalize using the symbolic integration at
c-tor PdfNonParametricltXgt f1( sqr( sin(x)
cos(x) ) , 0, 4 M_PI
)  // recompute the normalization every
time, since // the parameter tau may change
from call to call Parameter tau( 0.123 )
PdfParametricltXgt f2( x exp( - tau x ) ,
0, 10 )
30
Example of ?2 fit
  • X x
  • Parameter a( 0 ), b( 1 ), c( 0 )
  • FunctionltXgt parabola( c x( b xa ) )
  • UniformPartition partition( 100, -1.0, 1.0 )
  • Chi2ltFunctionltXgt gt chi2( parabola, partition )
  • Chi2FitterltChi2ltFunctionltXgt gt gt fitter( chi )
  • fitter.addParameter( "a", a.ptr() )
  • fitter.addParameter( "b", b.ptr() )
  • fitter.addParameter( "c", c.ptr() )
  • SampleErrltdoublegt sample( partition.bins() )
  • // fill the sample...
  • double par a, b, c , err 1, 1, 1
  • fitter.fit( par, err, sample )

31
Possible future improvement
  • Upper limit extraction based on Toy Monte Carlo
  • Support for ?2 fit with correlated errors and
    covariance matrix
  • Provide more standard PDFs
  • Crystal ball, Tchebichev polynomials,
  • Managing singular PDF
  • Delta-Dirac components
  • Managing (un)folding

32
Conclusions
  • The new analysis model addresses the limitations
    of the current analysis model
  • ROOT I/O based Event Store persistency
  • Backward compatibility with existing physics code
  • Flexibility and extensibility of the Event Store
    contents and central role of skims production
  • Iterative skimming of the events with progressive
    data reduction and enrichment in specific
    information provides
  • easier off-site portability of data
  • reduced resource needs ( CPU and disk space )
  • Moreover, it favors common tools developments and
    common analysis strategies among groups

33
Conclusion
  • We designed a new tool to model fit problems
  • Using template generic programming we obtained
  • Generality
  • User can plug-in new components (PDF,
    transformations, random generators, etc.)
  • Easy to incorporate in the tool external
    contributions
  • Light-weight
  • Most of the code is contained in header
    (include) files
  • Mild external dependencies
  • Easy to use
  • Very synthetic and expressive code
  • CPU Speed
  • Virtual function calls are extremely limited
  • Most of the methods are inlined
  • Interest has been expressed from
  • Geant4 Statistical testing toolkit
  • LCG/PI (LHC Computing Grid - Physics Interfaces)
  • Will focus on a release version shortly
Write a Comment
User Comments (0)
About PowerShow.com