A Toolkit for Statistical Data Analysis - PowerPoint PPT Presentation

About This Presentation
Title:

A Toolkit for Statistical Data Analysis

Description:

Validation of Geant4 physics models through comparison of simulation vs ... distribution function (EDF) and then enquired how close this would be to ... – PowerPoint PPT presentation

Number of Views:53
Avg rating:3.0/5.0
Slides: 39
Provided by: mariagr
Category:

less

Transcript and Presenter's Notes

Title: A Toolkit for Statistical Data Analysis


1
A Toolkit for Statistical Data Analysis
  • S. Donadio, F. Fabozzi, L. Lista, S. Guatelli, B.
    Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon, P.
    Viarengo

LCG Application Area Meeting CERN, 5 May 2004
http//www.ge.infn.it/geant4/analysis/HEPstatistic
s
2
History and background
3
The motivation from Geant4
Validation of Geant4 physics models through
comparison of simulation vs experimental data or
reference databases
4
Historical introduction to EDF tests
  • In 1933 Kolmogorov published a short, but
    landmark paper on the Italian Giornale
    dellIstituto degli Attuari. He formally defined
    the empirical distribution function (EDF) and
    then enquired how close this would be to the true
    distribution F(x), when this is continuous.
  • It must be noticed that Kolmogorov himself
    regarded his paper as the solution of an
    interesting probability problem, following the
    general interest of the time, rather than a paper
    on statistical methodology.
  • After Kolmogorov article, over a period of about
    10 years, the foundations were laid by a number
    of distinguished mathematicians of methods of
    testing fit to a distribution based on the EDF
    (Smirnov, Cramer, Von Mises, Anderson, Darling,
    ).
  • The ideas in this paper have formed a platform
    for vast literature, both of interesting and
    important probability problems, and also
    concerning methods of using the Kolmogorov
    statistics for testing fit to a distribution. The
    literature production continues with great
    strength today showing no sign to decrease.

5
Typical use cases in HEP
  • Regression testing
  • Throughout the software life-cycle
  • Online DAQ
  • Monitoring detector behaviour w.r.t. a reference
  • Simulation validation
  • Comparison with experimental data
  • Reconstruction
  • Comparison of reconstructed vs. expected
    distributions
  • Physics analysis
  • Comparisons of experimental distributions (ATLAS
    vs. CMS Higgs?)
  • Comparison with theoretical distributions (data
    vs. Standard Model)

6
Software tools
  • Commercial products used by professional
    statisticians
  • SPSS, NCSS...
  • In HEP
  • A lot of activity
  • workshops/conferences (CERN, Durham, SLAC etc.)
  • books (F. James et al., L. Lyons, R. Barlow etc.)
  • sophisticated statistical algorithms applied in
    various data analyses
  • ...but, in spite of the relevant role played by
    statistics in HEP, very limited availability of
    software tools for statistics in our field
  • and in open-source software in general

7
Lets do it ourselves...
A project to develop an open-source software
system for statistical analysis
Provide tools for the statistical comparison of
distributions Create a hub to aggregate
expertise and collaborative contributions from
scientists interested in statistical methods
see presentation at LCG-AA meeting, 27 November
2002
8
Vision the basics
  • Have a vision for the project
  • General purpose tool for statistical analysis
  • Toolkit approach (choice open to users)
  • Open source product

Clearly define scope, objectives
  • Rigorous software process

Software quality
Flexible, extensible, maintainable system
  • Build on a solid architecture

9
Architectural guidelines
  • The project adopts a solid architectural approach
  • to offer the functionality and the quality needed
    by the users
  • to be maintainable over a large time scale
  • to be extensible, to accommodate future
    evolutions of the requirements
  • Component-based architecture
  • to facilitate re-use and integration in diverse
    frameworks
  • Dependencies
  • adopt a standard (AIDA) for the user layer
  • no dependence on any specific analysis tool
  • Python
  • the glue for interactivity
  • The approach adopted is compatible with the
    recommendations of the LCG Architecture
    Blueprint Report

10
Software process
  • United Software Development Process, specifically
    tailored to the project
  • practical guidance and tools from the RUP
  • both rigorous and lightweight
  • mapping onto ISO 15504
  • significant experience gained in the group from
    other projects
  • Incremental and iterative life-cycle model

11
The Goodness-of-Fit component
12
User Requirements
  • User requirements elicited, analysed and formally
    specified
  • Functional (capability) and not-functional
    (constraint) requirements
  • User Requirements Document available from the web
    site

Requirement traceability
  • Requirements
  • Design
  • Implementation
  • Test test results
  • Documentation

13
(No Transcript)
14
(No Transcript)
15
  • Simple user layer
  • Shields the user from the complexity of the
    underlying algorithms and design
  • Only deal with AIDA objects and choice of
    comparison algorithm

16
GoF algorithms
  • Algorithms for binned distributions
  • Anderson-Darling test
  • Chi-squared test
  • Fisz-Cramer-von Mises test
  • Tiku test (Cramer-von Mises test in chi-squared
    approximation)
  • Algorithms for unbinned distributions
  • Anderson-Darling test
  • Fisz-Cramer-von Mises test
  • Goodman test (Kolmogorov-Smirnov test in
    chi-squared approximation)
  • Kolmogorov-Smirnov test
  • Kuiper test
  • Tiku test (Cramer-von Mises test in chi-squared
    approximation)

17
Chi-squared test
  • Applies to binned distributions
  • It can be useful also in case of unbinned
    distributions, but the data must be grouped into
    classes
  • Cannot be applied if the counting of the
    theoretical frequencies in each class is lt 5
  • When this is not the case, one could try to unify
    contiguous classes until the minimum theoretical
    frequency is reached

18
More sophisticated algorithms
Unbinned distributions
SUPREMUM STATISTICS
  • Kolmogorov-Smirnov test
  • Goodman approximation of KS test
  • Kuiper test

Dmn
19
More powerful algorithms
TESTS CONTAINING A WEIGHTING FUNCTION
Unbinned distributions
  • Cramer-von Mises test
  • Anderson-Darling test

Binned distributions
  • Fisz-Cramer-von Mises test
  • k-sample Anderson-Darling test

20
Comparative documentation of tests
Anderson-Darling High Sensitive to tails
c2 Low General
Fisz-Cramer-von Mises High Symmetric, right-skewed distributions
Goodman Medium Approximation of K-S to c2 test statistics
Kolmogorov-Smirnov Medium Derives from Kolmogorov statistics
Kuiper Medium Sensitive to tails and median
Tiku High Converts CvM statistics to a chi2
More about a comparative evaluation of tests in
the User Documentation on our web Topic still
subject to research activity in the domain of
statistics
21
Power of tests
In terms of power
The power of a test is the probability of
rejecting the null hypothesis correctly
Supremum statistics tests
Tests containing a weight function
?2
lt
lt
  • ?2 loses information in a test for unbinned
    distribution by grouping the data into cells
  • Kac, Kiefer and Wolfowitz (1955) showed that
    Kolmogorov-Smirnov test requires n4/5
    observations compared to n observations for ?2
    to attain the same power
  • Cramer-von Mises and Anderson-Darling statistics
    are expected to be superior to Kolmogorov-Smirnov
    s, since they make a comparison of the two
    distributions all along the range of x, rather
    than looking for a marked difference at one point

22
(No Transcript)
23
Unit test ?2 (1)
EXAMPLE FROM PICCOLO BOOK (STATISTICS - page 711)
The study concerns monthly birth and death
distributions (binned data)
?2 test-statistics 15.8 Expected ?2 15.8
Exact p-value0.200758 Expected p-value0.200757
Months
24
Unit test ?2 (2)
EXAMPLE FROM CRAMER BOOK (MATHEMATICAL
METHODS OF STATISTICS - page 447)
The study concerns the sex distribution of
children born in Sweden in 1935
25
Unit test K-S Goodman (1)
EXAMPLE FROM PICCOLO BOOK (STATISTICS - page 711)
The study concerns monthly birth and death
distributions (unbinned data)
Cumulative Function
Months
26
Unit test K-S Goodman (2)
Body lengths
27
Unit test Kolmogorov-Smirnov(1)
28
Unit test Kolmogorov-Smirnov (2)
29
Example of application results
30
GPL License
Latest release 30 March 2004
31
User Documentation
  • Download
  • Installation
  • User Guide
  • Statistics Reference Guide

32
A toolkit for modeling multi-parametric fit
problems
  • F. Fabozzi, L. Lista
  • INFN Napoli
  • Initially developed while rewriting a fortran
    fitter for BaBar analysis
  • Simultaneous estimate of
  • B(B? ?J/???) / B(B? ?J/?K?)
  • direct CP asymmetry
  • More control on the code was needed to justify a
    bias appeared in the original fitter

33
Requirements
  • Provide Tools for modeling parametric fit
    problems
  • Unbinned Maximum Likelihood (UML) fit of
  • PDF parameters
  • Yields of different sub-samples
  • Both, mixed
  • ?2 fits
  • Toy Monte Carlo to study the fit properties
  • Fitted parameter distributions
  • Pulls, Bias, Confidence level of fit results
  • not Unified Modeling Language ?

New components included in the Statistical
Toolkit Architecture open to extension and
evolution
34
For LCG users
  • The Statistical Toolkit is distributed with PI as
    an external product
  • Currently the previous release - not the latest
    yet - is distributed
  • Update foreseen
  • Integration in the Savannah system for problem
    reporting foreseen
  • Open to collaboration to facilitate the usage in
    the LGC community
  • feedback, user requirements, suggestions are
    welcome, of course!

Please contact Andreas.Pfeiffer_at_cern.ch for
further information about the Statistical Toolkit
in PI distribution
35
References
  • Conference Proceedings
  • PhyStat Conference, SLAC, 2003
  • IEEE Nuclear Science Symposium, Portland, 2003
  • Papers
  • S. Donadio et al., A toolkit for statistical data
    comparison
  • To be published in IEEE Trans. Nucl. Sci.
    (August 2004)
  • More papers in preparation
  • References kept up-to-date on the web site

36
http//www.ge.infn.it/geant4/analysis/HEPstatistic
s/ Will be moved to a new area out of Geant4-INFN
web (automatic re-direction)
37
Acknowledgments
  • Work supported and partially funded by the
    European Space Agency (ESA) under Contract
    No.16339/02/NL/FM
  • Geant4 beta testing
  • P. Cirrone (INFN-LNS), S. Guatelli (INFN Genova)
    , S. Parlati (INFN-LNGS)
  • Fred James (CERN) and Louis Lyons (Oxford)
  • many useful suggestions, discussions,
    encouragement...

38
Conclusions
  • A project to develop an open source, general
    purpose software toolkit for statistical data
    analysis is in progress
  • to provide a product of common interest to user
    communities
  • Rigorous software process
  • to contribute to the quality of the product
  • Component-based architecture, OO methods
    generic programming
  • to ensure openness to evolution, maintainability,
    ease of use
  • GoF component
  • Component for modeling multi-parametric fit
    problems
  • Software released and results available
  • toolkit in use for Geant4 physics validation
  • incremental and iterative life-cycle
Write a Comment
User Comments (0)
About PowerShow.com