A Toolkit for Statistical Data Analysis - PowerPoint PPT Presentation

1 / 27
About This Presentation
Title:

A Toolkit for Statistical Data Analysis

Description:

The project Software tools Commercial products used by professional statisticians SPSS ... User Guide Statistics ... from PICCOLO BOOK ... – PowerPoint PPT presentation

Number of Views:322
Avg rating:3.0/5.0
Slides: 28
Provided by: Maria1076
Category:

less

Transcript and Presenter's Notes

Title: A Toolkit for Statistical Data Analysis


1
A Toolkit for Statistical Data Analysis
  • B. Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon,
    P. Viarengo

CHEP 2004 Interlaken, 26-30 September 2004
http//www.ge.infn.it/geant4/analysis/HEPstatistic
s
Work supported and partially funded by the
European Space Agency (ESA) under Contract
No.16339/02/NL/FM
2
The project
A project to develop a statistical analysis
system
Provide tools for the statistical comparison of
distributions (Goodness-of Fit Tests)
  • Regression testing
  • Throughout the software life-cycle
  • Online DAQ
  • Monitoring detector behaviour w.r.t. a reference
  • Simulation validation
  • Comparison with experimental data
  • Reconstruction
  • Comparison of reconstructed vs. expected
    distributions
  • Physics analysis
  • Comparisons of experimental distributions (ATLAS
    vs. CMS Higgs?)
  • Comparison with theoretical distributions (data
    vs. Standard Model)

Typical use cases in HEP
3
Software tools
  • Commercial products used by professional
    statisticians
  • SPSS, NCSS...
  • In HEP
  • A lot of activity
  • workshops/conferences (CERN, Durham, SLAC etc.)
  • books (F. James et al., L. Lyons, R. Barlow etc.)
  • sophisticated statistical algorithms applied in
    various data analyses
  • ...but, in spite of the relevant role played by
    statistics in HEP, very limited availability of
    software tools for statistics in our field
  • and in open-source software in general

4
We need it, lets do the work ourselves...
A project to develop an open-source software
system for statistical analysis
Provide tools for the statistical comparison of
distributions Create a hub to aggregate
expertise and collaborative contributions from
scientists interested in statistical methods
5
Vision the basics
  • Rigorous software process

6
Architectural guidelines
  • The project adopts a solid architectural approach
  • to offer the functionality and the quality needed
    by the users
  • to be maintainable over a large time scale
  • to be extensible, to accommodate future
    evolutions of the requirements
  • Component-based architecture
  • to facilitate re-use and integration in diverse
    frameworks
  • Dependencies
  • adopt a standard (AIDA) for the user layer
  • no dependence on any specific analysis tool
  • Python
  • the glue for interactivity
  • The approach adopted is compatible with the
    recommendations of the LCG Architecture
    Blueprint Report

7
Software process
  • United Software Development Process, specifically
    tailored to the project
  • practical guidance and tools from the RUP
  • both rigorous and lightweight
  • mapping onto ISO 15504
  • significant experience gained in the group from
    other projects
  • Incremental and iterative life-cycle model

8
User Requirements
  • User requirements elicited, analysed and formally
    specified
  • Functional (capability) and not-functional
    (constraint) requirements
  • User Requirements Document available from the web
    site

Requirement traceability
  • Requirements
  • Design
  • Implementation
  • Test test results
  • Documentation

9
(No Transcript)
10
(No Transcript)
11
  • Simple user layer
  • Shields the user from the complexity of the
    underlying algorithms and design
  • Only deal with AIDA objects and choice of
    comparison algorithm

12
GoF algorithms
  • Algorithms for binned distributions
  • Anderson-Darling Test
  • Chi-squared Test
  • Fisz-Cramer-von Mises Test
  • Tiku Test (Cramer-von Mises test in chi-squared
    approximation)
  • Algorithms for unbinned distributions
  • Anderson-Darling Test
  • Fisz-Cramer-von Mises Test
  • Goodman Test (Kolmogorov-Smirnov test in
    chi-squared approximation)
  • Kolmogorov-Smirnov Test
  • Kuiper Test
  • Tiku test (Cramer-von Mises test in chi-squared
    approximation)

13
Chi-squared test
  • Applies to binned distributions
  • It can be useful also in case of unbinned
    distributions, but the data must be grouped into
    classes
  • Cannot be applied if the counting of the
    theoretical frequencies in each class is lt 5
  • When this is not the case, one could try to unify
    contiguous classes until the minimum theoretical
    frequency is reached

14
Tests based on a supremum statistics
Unbinned distributions
  • Kolmogorov-Smirnov Test
  • Goodman approximation of KS Test
  • Kuiper Test

Dmn
15
Tests containing a weighting function
Unbinned distributions
  • Cramer-von Mises Test
  • Anderson-Darling Test

Binned distributions
  • Fisz-Cramer-von Mises Test
  • k-sample Anderson-Darling Test

16
Comparative evaluation of tests
Anderson-Darling High Sensitive to tails
c2 Low General
Fisz-Cramer-von Mises High Symmetric, right-skewed distributions
Goodman Medium Approximation of K-S to c2 test statistics
Kolmogorov-Smirnov Medium Derives from Kolmogorov statistics
Kuiper Medium Sensitive to tails and median
Tiku High Converts CvM statistics to a c2
More about a comparative evaluation of tests in
the User Documentation on our web Topic still
subject to research activity in the domain of
statistics
17
Power of tests
The power of a test is the probability of
rejecting the null hypothesis correctly
In terms of power
  • ?2 loses information in a test for unbinned
    distribution by grouping the data into cells
  • Kac, Kiefer and Wolfowitz (1955) showed that
    Kolmogorov-Smirnov test requires n4/5
    observations compared to n observations for ?2
    to attain the same power
  • Cramer-von Mises and Anderson-Darling statistics
    are expected to be superior to Kolmogorov-Smirnov
    s, since they make a comparison of the two
    distributions all along the range of x, rather
    than looking for a marked difference at one point

Talk at IEEE NSS, Rome, 16-22 October 2004
paper submitted for publication November 2004
18
(No Transcript)
19
Unit test ?2
Test from PICCOLO BOOK (STATISTICS - page 711)
Exact p-value 0.200758 Expected p-value
0.200757
?2 test-statistics 15.8 Expected ?2 15.8
Binned data
Test from CRAMER BOOK (MATHEMATICAL METHODS OF
STATISTICS - page 447)
Exact p-value 0 Expected p-value 0
?2 test-statistics 123.203 Expected ?2 123.203
20
Unit test K-S Goodman
Test from PICCOLO BOOK (STATISTICS - page 711)
?2 test-statistics 3.9 Expected ?2 3.9
Exact p-value0.140974 Expected p-value0.140991
Test from LANDENNA BOOK (NONPARAMETRIC TESTS
BASED ON FREQUENCIES - page 287)
?2 test-statistics 1.5 Expected ?2 1.5
Exact p-value0.472367 Expected p-value0.472367
21
Unit test Kolmogorov-Smirnov
Test from LANDENNA BOOK (NONPARAMETRIC TESTS
BASED ON FREQUENCIES - page 318-325)
D test-statistics 0.65 Expected D 0.65
Cumulative
Exact p-value 2 10-19 Expected p-value 8 10-19
this is just a sample of the test process and
results!
22
GPL License
Feedback from users is welcome!
23
User Documentation
  • Download
  • Installation
  • User Guide
  • Statistics Reference Guide

24
Example of application results
Validation of Geant4 physics models w.r.t. NIST
reference
ESA Bepi Colombo mission to Mercury Test beam
at Bessy
Kolmogorov-Smirnov Test
Data range Distance p-value
-84 ? -60 mm 0.38 0.23
-59 ? -48 mm 0.27 0.90
-47 ? 47 mm 0.43 0.19
48 ? 59 mm 0.30 0.82
60 ? 84 mm 0.40 0.10
Dosimetry at IST Cancer Inst. Monte Carlo and
experimental data
Intervallo distanza Distanza Livello di significatività
-84 ? -60 mm 0.385 0.23
-59 ? -48 mm 0.27 0.90
-47 ? 47 mm 0.43 0.19
48 ? 59 mm 0.30 0.82
60 ? 84 mm 0.40 0.10
25
A toolkit for modeling multi-parametric fit
problems
  • F. Fabozzi, L. Lista
  • INFN Napoli
  • Initially developed while rewriting a FORTRAN
    fitter for BaBar analysis
  • Simultaneous estimate of
  • B(B? ?J/???) / B(B? ?J/?K?)
  • direct CP asymmetry
  • More control on the code was needed to justify a
    bias appeared in the original fitter

New components included in the Statistical
Toolkit Toy Monte Carlo, PDF modelling, Max
Likelihood Fits Architecture open to extension
and evolution
26
Feel free to contact us!
27
Conclusions
  • A project to develop an open source, general
    purpose software toolkit for statistical data
    analysis is in progress
  • to provide a product of common interest to user
    communities
  • Rigorous software process
  • to contribute to the quality of the product
  • Component-based architecture, OO methods
    generic programming
  • to ensure openness to evolution, maintainability,
    ease of use
  • GoF component
  • Component for modeling multi-parametric fit
    problems
  • Software released and application results
    available
  • toolkit in use for Geant4 physics validation and
    in experiments
  • paper published on IEEE Trans. Nucl. Sci., 3
    October 2004

Thanks to Fred James (CERN) and Louis Lyons
(Oxford) for many useful suggestions,
discussions, encouragement..
Write a Comment
User Comments (0)
About PowerShow.com