Title: A Toolkit for Statistical Data Analysis
1A Toolkit for Statistical Data Analysis
- S. Donadio, F. Fabozzi, L. Lista, S. Guatelli, B.
Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon, P.
Viarengo
LCG Application Area Meeting CERN, 5 May 2004
http//www.ge.infn.it/geant4/analysis/HEPstatistic
s
2History and background
3The motivation from Geant4
Validation of Geant4 physics models through
comparison of simulation vs experimental data or
reference databases
4Historical introduction to EDF tests
- In 1933 Kolmogorov published a short, but
landmark paper on the Italian Giornale
dellIstituto degli Attuari. He formally defined
the empirical distribution function (EDF) and
then enquired how close this would be to the true
distribution F(x), when this is continuous. - It must be noticed that Kolmogorov himself
regarded his paper as the solution of an
interesting probability problem, following the
general interest of the time, rather than a paper
on statistical methodology. - After Kolmogorov article, over a period of about
10 years, the foundations were laid by a number
of distinguished mathematicians of methods of
testing fit to a distribution based on the EDF
(Smirnov, Cramer, Von Mises, Anderson, Darling,
). - The ideas in this paper have formed a platform
for vast literature, both of interesting and
important probability problems, and also
concerning methods of using the Kolmogorov
statistics for testing fit to a distribution. The
literature production continues with great
strength today showing no sign to decrease.
5Typical use cases in HEP
- Regression testing
- Throughout the software life-cycle
- Online DAQ
- Monitoring detector behaviour w.r.t. a reference
- Simulation validation
- Comparison with experimental data
- Reconstruction
- Comparison of reconstructed vs. expected
distributions - Physics analysis
- Comparisons of experimental distributions (ATLAS
vs. CMS Higgs?) - Comparison with theoretical distributions (data
vs. Standard Model)
6Software tools
- Commercial products used by professional
statisticians - SPSS, NCSS...
- In HEP
- A lot of activity
- workshops/conferences (CERN, Durham, SLAC etc.)
- books (F. James et al., L. Lyons, R. Barlow etc.)
- sophisticated statistical algorithms applied in
various data analyses - ...but, in spite of the relevant role played by
statistics in HEP, very limited availability of
software tools for statistics in our field - and in open-source software in general
7Lets do it ourselves...
A project to develop an open-source software
system for statistical analysis
Provide tools for the statistical comparison of
distributions Create a hub to aggregate
expertise and collaborative contributions from
scientists interested in statistical methods
see presentation at LCG-AA meeting, 27 November
2002
8Vision the basics
- Have a vision for the project
- General purpose tool for statistical analysis
- Toolkit approach (choice open to users)
- Open source product
Clearly define scope, objectives
- Rigorous software process
Software quality
Flexible, extensible, maintainable system
- Build on a solid architecture
9Architectural guidelines
- The project adopts a solid architectural approach
- to offer the functionality and the quality needed
by the users - to be maintainable over a large time scale
- to be extensible, to accommodate future
evolutions of the requirements - Component-based architecture
- to facilitate re-use and integration in diverse
frameworks - Dependencies
- adopt a standard (AIDA) for the user layer
- no dependence on any specific analysis tool
- Python
- the glue for interactivity
- The approach adopted is compatible with the
recommendations of the LCG Architecture
Blueprint Report
10Software process
- United Software Development Process, specifically
tailored to the project - practical guidance and tools from the RUP
- both rigorous and lightweight
- mapping onto ISO 15504
- significant experience gained in the group from
other projects - Incremental and iterative life-cycle model
11The Goodness-of-Fit component
12User Requirements
- User requirements elicited, analysed and formally
specified - Functional (capability) and not-functional
(constraint) requirements - User Requirements Document available from the web
site
Requirement traceability
- Requirements
- Design
- Implementation
- Test test results
- Documentation
13(No Transcript)
14(No Transcript)
15- Simple user layer
- Shields the user from the complexity of the
underlying algorithms and design - Only deal with AIDA objects and choice of
comparison algorithm
16GoF algorithms
- Algorithms for binned distributions
- Anderson-Darling test
- Chi-squared test
- Fisz-Cramer-von Mises test
- Tiku test (Cramer-von Mises test in chi-squared
approximation) - Algorithms for unbinned distributions
- Anderson-Darling test
- Fisz-Cramer-von Mises test
- Goodman test (Kolmogorov-Smirnov test in
chi-squared approximation) - Kolmogorov-Smirnov test
- Kuiper test
- Tiku test (Cramer-von Mises test in chi-squared
approximation)
17Chi-squared test
- Applies to binned distributions
- It can be useful also in case of unbinned
distributions, but the data must be grouped into
classes - Cannot be applied if the counting of the
theoretical frequencies in each class is lt 5 - When this is not the case, one could try to unify
contiguous classes until the minimum theoretical
frequency is reached
18More sophisticated algorithms
Unbinned distributions
SUPREMUM STATISTICS
- Kolmogorov-Smirnov test
- Goodman approximation of KS test
- Kuiper test
Dmn
19More powerful algorithms
TESTS CONTAINING A WEIGHTING FUNCTION
Unbinned distributions
- Cramer-von Mises test
- Anderson-Darling test
Binned distributions
- Fisz-Cramer-von Mises test
- k-sample Anderson-Darling test
20Comparative documentation of tests
Anderson-Darling High Sensitive to tails
c2 Low General
Fisz-Cramer-von Mises High Symmetric, right-skewed distributions
Goodman Medium Approximation of K-S to c2 test statistics
Kolmogorov-Smirnov Medium Derives from Kolmogorov statistics
Kuiper Medium Sensitive to tails and median
Tiku High Converts CvM statistics to a chi2
More about a comparative evaluation of tests in
the User Documentation on our web Topic still
subject to research activity in the domain of
statistics
21Power of tests
In terms of power
The power of a test is the probability of
rejecting the null hypothesis correctly
Supremum statistics tests
Tests containing a weight function
?2
lt
lt
- ?2 loses information in a test for unbinned
distribution by grouping the data into cells - Kac, Kiefer and Wolfowitz (1955) showed that
Kolmogorov-Smirnov test requires n4/5
observations compared to n observations for ?2
to attain the same power - Cramer-von Mises and Anderson-Darling statistics
are expected to be superior to Kolmogorov-Smirnov
s, since they make a comparison of the two
distributions all along the range of x, rather
than looking for a marked difference at one point
22(No Transcript)
23Unit test ?2 (1)
EXAMPLE FROM PICCOLO BOOK (STATISTICS - page 711)
The study concerns monthly birth and death
distributions (binned data)
?2 test-statistics 15.8 Expected ?2 15.8
Exact p-value0.200758 Expected p-value0.200757
Months
24Unit test ?2 (2)
EXAMPLE FROM CRAMER BOOK (MATHEMATICAL
METHODS OF STATISTICS - page 447)
The study concerns the sex distribution of
children born in Sweden in 1935
25Unit test K-S Goodman (1)
EXAMPLE FROM PICCOLO BOOK (STATISTICS - page 711)
The study concerns monthly birth and death
distributions (unbinned data)
Cumulative Function
Months
26Unit test K-S Goodman (2)
Body lengths
27Unit test Kolmogorov-Smirnov(1)
28Unit test Kolmogorov-Smirnov (2)
29Example of application results
30GPL License
Latest release 30 March 2004
31User Documentation
- Download
- Installation
- User Guide
- Statistics Reference Guide
32A toolkit for modeling multi-parametric fit
problems
- F. Fabozzi, L. Lista
- INFN Napoli
- Initially developed while rewriting a fortran
fitter for BaBar analysis - Simultaneous estimate of
- B(B? ?J/???) / B(B? ?J/?K?)
- direct CP asymmetry
- More control on the code was needed to justify a
bias appeared in the original fitter
33Requirements
- Provide Tools for modeling parametric fit
problems - Unbinned Maximum Likelihood (UML) fit of
- PDF parameters
- Yields of different sub-samples
- Both, mixed
- ?2 fits
- Toy Monte Carlo to study the fit properties
- Fitted parameter distributions
- Pulls, Bias, Confidence level of fit results
- not Unified Modeling Language ?
New components included in the Statistical
Toolkit Architecture open to extension and
evolution
34For LCG users
- The Statistical Toolkit is distributed with PI as
an external product - Currently the previous release - not the latest
yet - is distributed - Update foreseen
- Integration in the Savannah system for problem
reporting foreseen - Open to collaboration to facilitate the usage in
the LGC community - feedback, user requirements, suggestions are
welcome, of course!
Please contact Andreas.Pfeiffer_at_cern.ch for
further information about the Statistical Toolkit
in PI distribution
35References
- Conference Proceedings
- PhyStat Conference, SLAC, 2003
- IEEE Nuclear Science Symposium, Portland, 2003
- Papers
- S. Donadio et al., A toolkit for statistical data
comparison - To be published in IEEE Trans. Nucl. Sci.
(August 2004) - More papers in preparation
- References kept up-to-date on the web site
-
36http//www.ge.infn.it/geant4/analysis/HEPstatistic
s/ Will be moved to a new area out of Geant4-INFN
web (automatic re-direction)
37Acknowledgments
- Work supported and partially funded by the
European Space Agency (ESA) under Contract
No.16339/02/NL/FM - Geant4 beta testing
- P. Cirrone (INFN-LNS), S. Guatelli (INFN Genova)
, S. Parlati (INFN-LNGS) - Fred James (CERN) and Louis Lyons (Oxford)
- many useful suggestions, discussions,
encouragement...
38Conclusions
- A project to develop an open source, general
purpose software toolkit for statistical data
analysis is in progress - to provide a product of common interest to user
communities - Rigorous software process
- to contribute to the quality of the product
- Component-based architecture, OO methods
generic programming - to ensure openness to evolution, maintainability,
ease of use - GoF component
- Component for modeling multi-parametric fit
problems - Software released and results available
- toolkit in use for Geant4 physics validation
- incremental and iterative life-cycle