An update on the Goodness of Fit Statistical Toolkit - PowerPoint PPT Presentation

About This Presentation

Title:

An update on the Goodness of Fit Statistical Toolkit

Description:

Anderson-Darling test. Anderson-Darling approximated test. Cramer-von Mises test ... Anderson-Darling. Unbinned Distributions. Binned Distributions. AVERAGE CPU TIME ... – PowerPoint PPT presentation

Number of Views:56

Avg rating:3.0/5.0

Slides: 21

Provided by: mariagr

Category:

more less

Transcript and Presenter's Notes

Title: An update on the Goodness of Fit Statistical Toolkit

1
An update on the Goodness of Fit Statistical
Toolkit

B. Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon,
P. Viarengo

4th Geant4 Space Users Workshop
http//www.ge.infn.it/geant4/analysis/HEPstatistic
s http//www.ge.infn.it/statisticaltoolkit
2
Goodness of Fit testing
Goodness-of-fit testing is the mathematical
foundation for the comparison of data
distributions

Regression testing
Throughout the software life-cycle
Online DAQ
Monitoring detector behaviour w.r.t. a reference
Simulation validation
Comparison with experimental data
Reconstruction
Comparison of reconstructed vs. expected
distributions
Physics analysis
Comparison with theoretical distributions
Comparisons of experimental distributions

Use cases in experimental physics
3
(No Transcript)
4
Software process guidelines

Adopt a process
software quality
Unified Process, specifically tailored to the
project
practical guidance and tools from the RUP
both rigorous and lightweight
mapping onto ISO 15504 (and CMM)
Incremental and iterative life-cycle
1st cycle 2-sample GoF tests
1-sample GoF in preparation

5
Architectural guidelines

The project adopts a solid architectural approach
to offer the functionality and the quality needed
by the users
to be maintainable over a large time scale
to be extensible, to accommodate future
evolutions of the requirements
Component-based architecture
to facilitate re-use and integration in diverse
frameworks
layer architecture pattern
core component for statistical computation
independent components for interface to user
analysis environments
Dependencies
no dependence on any specific analysis tool
can be used by any analysis tools, or together
with any analysis tools
offer a (HEP) standard (AIDA) for the user layer

6
(No Transcript)
7
The algorithms are specialised on the kind of
distribution (binned/unbinned)
8
GoF algorithms in the Statistical Toolkit
TWO-SAMPLE PROBLEM

Binned distributions
Anderson-Darling test
Anderson-Darling approximated test
Chi-squared test
Fisz-Cramer-von Mises test
Tiku test (Cramer-von Mises test in chi-squared
approximation)

It is the most complete software for the
comparison of two distributions, even among
commercial/professional statistics tools. It
provides all 2-sample (edf) GoF algorithms
existing in statistics literature

Unbinned distributions
Anderson-Darling test
Anderson-Darling approximated test
Cramer-von Mises test
Generalised Girone test
Goodman test (Kolmogorov-Smirnov test in
chi-squared approximation)
Kolmogorov-Smirnov test
Kuiper test
Tiku test (Cramer-von Mises test in chi-squared
approximation)
Weighted Kolmogorov-Smirnov test (2 flavours)
Weighted Cramer-von Mises test

9
User Layer

Simple user layer
Shields the user from the complexity of the
underlying algorithms and design
Only deal with the users analysis objects and
choice of comparison algorithm
First release user layer for AIDA analysis
objects
LCG Architecture Blueprint, Geant4 requirement
Second release added user layer for ROOT
analysis objects
in response to user requirements

10
Which test to use?

Do we really need such a wide collection of GoF
tests? Why?
Which is the most appropriate test to compare two
distributions?
How good is a test at recognizing real
equivalent distributions and rejecting fake ones?
The choice of the most suitable GoF test can be
performed on the basis of two different criteria
Computational performance
Statistical performance (power)

11
A) Performance of the GoF tests
AVERAGE CPU TIME Binned Distributions Unbinned Distributions
Anderson-Darling (0.690.01) ms (16.90.2) ms
Anderson-Darling (approximated) (0.600.01) ms (16.10.2) ms
Chi-squared (0.550.01) ms
Cramer-von Mises (0.440.01) ms (16.30.2) ms
Generalised Girone (15.90.2) ms
Goodman (11.90.1) ms
Kolmogorov-Smirnov (8.90.1) ms
Kuiper (12.10.1) ms
Tiku (0.690.01) ms (16.70.2) ms
Watson (14.20.1) ms
Weighted Kolmogorov-Smirnov (AD) (14.00.1) ms
Weighted Kolmogorov-Smirnov (Buning) (14.00.1) ms
Weighted Cramer-von Mises (14.00.1) ms
12
B) Power of GoF tests
The power of a test is the probability of
rejecting the null hypothesis correctly

Systematic study of all existing GoF tests in
progress
made possible by the extensive collection of
tests in the Statistical Toolkit
GoF tests power evaluated in a variety of
alternative situations considered
No clear winner the statistical performance of a
test depends on the features of the distributions
to be compared (skewness and tailweight) and on
the sample size
Practical recommendations
first classify the type of the distributions in
terms of skewness and tailweight
choose the most appropriate test given the type
of distributions evaluating the best test by
means of the quantitative model proposed
Topic still subject to research activity in the
domain of statistics

General recipe
plt0.0001
13
Examples of practical applications
14
Statistical Toolkit Usage

Geant4 physics validation
rigorous approach quantitative evaluation of
Geant4 physics models with respect to established
reference data
see for instance K. Amako et al., Comparison of
Geant4 electromagnetic physics models against the
NIST reference dataIEEE Trans. Nucl. Sci. 52-
4 (2005) 910-918
LCG Simulation Validation project
see for instance A. Ribon, Testing Geant4 with a
simplified calorimeter setup, http//www.ge.infn.i
t/geant4/events/july2005
CMS
validation of new histograms w.r.t. reference
ones in OSCAR Validation Suite
Usage also in space science, medicine,
statistics, etc.

15
Validation of Geant4 e.m. physics models vs. NIST
reference data
Experimental set-up
Electron Stopping Power
centre
p-value stability study
Geant4 LowE Penelope Geant4 Standard Geant4 LowE
EEDL NIST - XCOM
c2 test (to include data uncertainties in the
computation of the test statistics value)
p-value
Geant4 LowE Penelope Geant4 Standard Geant4 LowE
EEDL
The three Geant4 models are equivalent
H0 REJECTION AREA
Z
16
Validation of Geant4 Atomic Relaxation vs NIST
reference data
Shell-end Kolmogorov-Smirnov D p-value
10 0.0192 1
11 0.0175 1
13 0.0250 1
14 0.0256 1
18 0.0294 1
19 0.0312 1
21 0.1429 0.997085
22 0.0588 1
Fluorescence - Shell-start 3
? Geant4 ? NIST
17
Validation of Geant4 electromagnetic and hadronic
models against proton data

Low Energy EM ICRU49 p, ions
Low Energy EM Livermore g, e-
Standard EM e
HadronElastic with BertiniElastic
Bertini Inelastic

LowE EM ICRU49
BertiniElastic
Bertini Inelastic
0.5 M events
p-value p-value p-value
CvM KS AD
Left branch 0.977
Right branch 0.985
Whole curve 0.994
CvM Cramer-von Mises test KS
Kolmogorov-Smirnov test AD Anderson-Darling
test
Geant4 Experimental data
mm
18
Test beam at Bessy Bepi-Colombo mission
c2 not appropriate (lt 5 entries in some bins,
physical information would be lost if rebinned)
Experimental measurements are comparable with
Geant4 simulations
Anderson-Darling Ac (95) 0.752
19
Comparison of alternative vehicle concepts in
human missions to Mars
Reference rigid structures as in the ISS (2 - 4
cm Al)

Kolmogorov-Smirnov test
Multi-layer 10 cm water equivalent to 4 cm Al
Multi-layer 5 cm water equivalent to 2.15 cm
Al

Shielding material Energy deposited in phantom (MeV) Energy deposited in phantom (MeV) Energy deposited in phantom (MeV)
Shielding material EM Bertini Binary
ML 5 cm water 73.5 0.3 130.2 0.5 119.3 0.4
ML 10 cm water 71.9 0.3 128.0 0.5 117.3 0.5
4 cm Al 72.9 0.3 127.5 0.5 117.0 0.4
2.15 cm Al 73.9 0.3 130.5 0.5 119.3 0.5
Inflatable habitat vs a conventional rigid
habitat
An inflatable habitat exhibits a shielding
capability equivalent to a conventional rigid one
20
Conclusions

A novel, complete software software toolkit for
statistical analysis is being developed
all the two-sample GoF tests available in
statistical domain chi-squared test
rigorous architectural design
rigorous software process
It is the most complete software for the
comparison of two distributions, even among
commercial/professional statistics tools.
A systematic study of the power of GoF tests is
in progress
unexplored area of research
Application in various domains
Geant4, HEP, space science, medicine
Feedback and suggestions are very much
appreciated