A Toolkit for Statistical Data Analysis - PowerPoint PPT Presentation

About This Presentation

Title:

A Toolkit for Statistical Data Analysis

Description:

Validation of Geant4 physics models through comparison of simulation vs ... distribution function (EDF) and then enquired how close this would be to ... – PowerPoint PPT presentation

Number of Views:53

Avg rating:3.0/5.0

Slides: 39

Provided by: mariagr

Category:

more less

Transcript and Presenter's Notes

Title: A Toolkit for Statistical Data Analysis

1
A Toolkit for Statistical Data Analysis

S. Donadio, F. Fabozzi, L. Lista, S. Guatelli, B.
Mascialino, A. Pfeiffer, M.G. Pia, A. Ribon, P.
Viarengo

LCG Application Area Meeting CERN, 5 May 2004
http//www.ge.infn.it/geant4/analysis/HEPstatistic
s
2
History and background
3
The motivation from Geant4
Validation of Geant4 physics models through
comparison of simulation vs experimental data or
reference databases
4
Historical introduction to EDF tests

In 1933 Kolmogorov published a short, but
landmark paper on the Italian Giornale
dellIstituto degli Attuari. He formally defined
the empirical distribution function (EDF) and
then enquired how close this would be to the true
distribution F(x), when this is continuous.
It must be noticed that Kolmogorov himself
regarded his paper as the solution of an
interesting probability problem, following the
general interest of the time, rather than a paper
on statistical methodology.
After Kolmogorov article, over a period of about
10 years, the foundations were laid by a number
of distinguished mathematicians of methods of
testing fit to a distribution based on the EDF
(Smirnov, Cramer, Von Mises, Anderson, Darling,
).
The ideas in this paper have formed a platform
for vast literature, both of interesting and
important probability problems, and also
concerning methods of using the Kolmogorov
statistics for testing fit to a distribution. The
literature production continues with great
strength today showing no sign to decrease.

5
Typical use cases in HEP

Regression testing
Throughout the software life-cycle
Online DAQ
Monitoring detector behaviour w.r.t. a reference
Simulation validation
Comparison with experimental data
Reconstruction
Comparison of reconstructed vs. expected
distributions
Physics analysis
Comparisons of experimental distributions (ATLAS
vs. CMS Higgs?)
Comparison with theoretical distributions (data
vs. Standard Model)

6
Software tools

Commercial products used by professional
statisticians
SPSS, NCSS...
In HEP
A lot of activity
workshops/conferences (CERN, Durham, SLAC etc.)
books (F. James et al., L. Lyons, R. Barlow etc.)
sophisticated statistical algorithms applied in
various data analyses
...but, in spite of the relevant role played by
statistics in HEP, very limited availability of
software tools for statistics in our field
and in open-source software in general

7
Lets do it ourselves...
A project to develop an open-source software
system for statistical analysis
Provide tools for the statistical comparison of
distributions Create a hub to aggregate
expertise and collaborative contributions from
scientists interested in statistical methods
see presentation at LCG-AA meeting, 27 November
2002
8
Vision the basics

Have a vision for the project
General purpose tool for statistical analysis
Toolkit approach (choice open to users)
Open source product

Clearly define scope, objectives

Rigorous software process

Software quality
Flexible, extensible, maintainable system

Build on a solid architecture

9
Architectural guidelines

The project adopts a solid architectural approach
to offer the functionality and the quality needed
by the users
to be maintainable over a large time scale
to be extensible, to accommodate future
evolutions of the requirements
Component-based architecture
to facilitate re-use and integration in diverse
frameworks
Dependencies
adopt a standard (AIDA) for the user layer
no dependence on any specific analysis tool
Python
the glue for interactivity
The approach adopted is compatible with the
recommendations of the LCG Architecture
Blueprint Report

10
Software process

United Software Development Process, specifically
tailored to the project
practical guidance and tools from the RUP
both rigorous and lightweight
mapping onto ISO 15504
significant experience gained in the group from
other projects
Incremental and iterative life-cycle model

11
The Goodness-of-Fit component
12
User Requirements

User requirements elicited, analysed and formally
specified
Functional (capability) and not-functional
(constraint) requirements
User Requirements Document available from the web
site

Requirement traceability

Requirements
Design
Implementation
Test test results
Documentation

13
(No Transcript)
14
(No Transcript)
15

Simple user layer
Shields the user from the complexity of the
underlying algorithms and design
Only deal with AIDA objects and choice of
comparison algorithm

16
GoF algorithms

Algorithms for binned distributions
Anderson-Darling test
Chi-squared test
Fisz-Cramer-von Mises test
Tiku test (Cramer-von Mises test in chi-squared
approximation)
Algorithms for unbinned distributions
Anderson-Darling test
Fisz-Cramer-von Mises test
Goodman test (Kolmogorov-Smirnov test in
chi-squared approximation)
Kolmogorov-Smirnov test
Kuiper test
Tiku test (Cramer-von Mises test in chi-squared
approximation)

17
Chi-squared test

Applies to binned distributions
It can be useful also in case of unbinned
distributions, but the data must be grouped into
classes
Cannot be applied if the counting of the
theoretical frequencies in each class is lt 5
When this is not the case, one could try to unify
contiguous classes until the minimum theoretical
frequency is reached

18
More sophisticated algorithms
Unbinned distributions
SUPREMUM STATISTICS

Kolmogorov-Smirnov test
Goodman approximation of KS test
Kuiper test

Dmn
19
More powerful algorithms
TESTS CONTAINING A WEIGHTING FUNCTION
Unbinned distributions

Cramer-von Mises test
Anderson-Darling test

Binned distributions

Fisz-Cramer-von Mises test
k-sample Anderson-Darling test

20
Comparative documentation of tests
Anderson-Darling High Sensitive to tails
c2 Low General
Fisz-Cramer-von Mises High Symmetric, right-skewed distributions
Goodman Medium Approximation of K-S to c2 test statistics
Kolmogorov-Smirnov Medium Derives from Kolmogorov statistics
Kuiper Medium Sensitive to tails and median
Tiku High Converts CvM statistics to a chi2
More about a comparative evaluation of tests in
the User Documentation on our web Topic still
subject to research activity in the domain of
statistics
21
Power of tests
In terms of power
The power of a test is the probability of
rejecting the null hypothesis correctly
Supremum statistics tests
Tests containing a weight function
?2
lt
lt

?2 loses information in a test for unbinned
distribution by grouping the data into cells
Kac, Kiefer and Wolfowitz (1955) showed that
Kolmogorov-Smirnov test requires n4/5
observations compared to n observations for ?2
to attain the same power
Cramer-von Mises and Anderson-Darling statistics
are expected to be superior to Kolmogorov-Smirnov
s, since they make a comparison of the two
distributions all along the range of x, rather
than looking for a marked difference at one point

22
(No Transcript)
23
Unit test ?2 (1)
EXAMPLE FROM PICCOLO BOOK (STATISTICS - page 711)
The study concerns monthly birth and death
distributions (binned data)
?2 test-statistics 15.8 Expected ?2 15.8
Exact p-value0.200758 Expected p-value0.200757
Months
24
Unit test ?2 (2)
EXAMPLE FROM CRAMER BOOK (MATHEMATICAL
METHODS OF STATISTICS - page 447)
The study concerns the sex distribution of
children born in Sweden in 1935
25
Unit test K-S Goodman (1)
EXAMPLE FROM PICCOLO BOOK (STATISTICS - page 711)
The study concerns monthly birth and death
distributions (unbinned data)
Cumulative Function
Months
26
Unit test K-S Goodman (2)
Body lengths
27
Unit test Kolmogorov-Smirnov(1)
28
Unit test Kolmogorov-Smirnov (2)
29
Example of application results
30
GPL License
Latest release 30 March 2004
31
User Documentation

Download
Installation
User Guide
Statistics Reference Guide

32
A toolkit for modeling multi-parametric fit
problems

F. Fabozzi, L. Lista
INFN Napoli

Initially developed while rewriting a fortran
fitter for BaBar analysis
Simultaneous estimate of
B(B? ?J/???) / B(B? ?J/?K?)
direct CP asymmetry
More control on the code was needed to justify a
bias appeared in the original fitter

33
Requirements

Provide Tools for modeling parametric fit
problems
Unbinned Maximum Likelihood (UML) fit of
PDF parameters
Yields of different sub-samples
Both, mixed
?2 fits
Toy Monte Carlo to study the fit properties
Fitted parameter distributions
Pulls, Bias, Confidence level of fit results
not Unified Modeling Language ?

New components included in the Statistical
Toolkit Architecture open to extension and
evolution
34
For LCG users

The Statistical Toolkit is distributed with PI as
an external product
Currently the previous release - not the latest
yet - is distributed
Update foreseen
Integration in the Savannah system for problem
reporting foreseen
Open to collaboration to facilitate the usage in
the LGC community
feedback, user requirements, suggestions are
welcome, of course!

Please contact Andreas.Pfeiffer_at_cern.ch for
further information about the Statistical Toolkit
in PI distribution
35
References

Conference Proceedings
PhyStat Conference, SLAC, 2003
IEEE Nuclear Science Symposium, Portland, 2003
Papers
S. Donadio et al., A toolkit for statistical data
comparison
To be published in IEEE Trans. Nucl. Sci.
(August 2004)
More papers in preparation
References kept up-to-date on the web site

36
http//www.ge.infn.it/geant4/analysis/HEPstatistic
s/ Will be moved to a new area out of Geant4-INFN
web (automatic re-direction)
37
Acknowledgments

Work supported and partially funded by the
European Space Agency (ESA) under Contract
No.16339/02/NL/FM
Geant4 beta testing
P. Cirrone (INFN-LNS), S. Guatelli (INFN Genova)
, S. Parlati (INFN-LNGS)
Fred James (CERN) and Louis Lyons (Oxford)
many useful suggestions, discussions,
encouragement...

38
Conclusions

A project to develop an open source, general
purpose software toolkit for statistical data
analysis is in progress
to provide a product of common interest to user
communities
Rigorous software process
to contribute to the quality of the product
Component-based architecture, OO methods
generic programming
to ensure openness to evolution, maintainability,
ease of use
GoF component
Component for modeling multi-parametric fit
problems
Software released and results available
toolkit in use for Geant4 physics validation
incremental and iterative life-cycle