Title: Introduction to High Energy Physics Data Analysis Software Pere Mato CERN PHSFT
1Introduction to High Energy Physics Data
Analysis Software Pere Mato(CERN/ PH-SFT)
Very much from a Large Hadron Collider (LHC)
perspective
2Talk Outline
- Physics signatures and rates
- Data processing and datasets
- Software structure and frameworks
- Software components and domains
- Usage of third-party software
- Summary
3Example The Atlas Detector
- The ATLAS collaboration is
- 2000 physicists from ..
- 150 universities and labs
- from 34 countries
- distributed resources
- remote development
- The ATLAS detector is
- 26m long,
- stands 20m high,
- weighs 7000 tons
- has 200 million read-out channels
4Atlas Physics Signatures and Event Rates
- LHC pp collisions at ?s 14 TeV
- Bunches cross at 40 MHz
- sinelastic 80 mb
- at high L gtgt 1 pp collision/crossing
- 109 collisions per second
- Study different physics channels each with their
own signature e.g. - Higgs
- Supersymmetry
- B physics
- Interesting physics events are buried in
backgrounds of uninteresting physics events ( 1
in 105 - 109 of recorded events)
5 HEP Processing stages and datasets
event filter (selection reconstruction)
detector
processed data
Event Summary Data (ESD)
raw data
batch physics analysis
event reconstruction
Analysis Object Data (AOD) (extracted by physics
topic)
event simulation
individual physics analysis
6Data and Algorithms
- HEP main data is organized in Events (particle
collisions) - Simulation, Reconstruction and Analysis programs
process one Event at the time - Events are fairly independent to each other
- Trivial parallel processing
- Event processing programs are composed of a
number of Algorithms selecting and transforming
raw Event data into new processed Event data
and statistics - Algorithms are mainly developed by Physicists
- Algorithms may require additional detector
conditions data (e.g. calibrations, geometry,
environmental parameters, etc. ) - Statistical data (histograms, distributions,
etc.) are typically the final data processing
results
7Data Hierarchy
RAW
Triggered events recorded by DAQ
Detector digitisation 109 events/yr 2 MB 2
PB/yr
2 MB/event
ESD
Reconstructed information
Pattern recognition information Clusters, track
candidates
100 kB/event
Physical information Transverse momentum,
Association of particles, jets, (best) id of
particles,
AOD
Analysis information
10 kB/event
TAG
Relevant information for fast event selection
Classification information
1 kB/event
8Software Organization
Applications built on top of frameworks and
implementing the required algorithms.
Reconstruction
Simulation
High level triggers
Analysis
One framework for basic services various
specialized frameworks detector description,
visualization, persistency, interactivity,
simulation, etc.
Frameworks Toolkits
A series of basic libraries widely used STL,
CLHEP, GSL etc.
Foundation Libraries
9Software Frameworks
- Experiments develop Software Frameworks
- General Architecture of the Event processing
applications - To achieve coherency and to facilitate software
re-use - Hide technical details to the end-user Physicists
(providers of the Algorithms) - Applications are developed by customizing the
Framework - By the composition of elemental Algorithms to
form complete applications - Using third-party components wherever possible
and configuringthem
Example Gaudi framework (C) in use by LHCb
and ATLAS
10Software Components
- Foundation Libraries
- Basic types
- Utility libraries
- System isolation libraries
- Mathematical Libraries
- Special functions
- Minimization, Random Numbers
- Data Organization
- Event Data
- Event Metadata (Event collections)
- Detector Conditions Data
- Data Management Tools
- Object Persistency
- Data Distribution and Replication
- Simulation Toolkits
- Event generators
- Detector simulation
- Statistical Analysis Tools
- Histograms, N-tuples
- Fitting
- Interactivity and User Interfaces
- GUI
- Scripting
- Interactive analysis
- Data Visualization and Graphics
- Event and Geometry displays
- Distributed Applications
- Parallel processing
- Grid computing
11Components and Domains
12Event Data
- Complex data models
- 500 structure types
- References to describe relationships between
event objects - unidirectional
- Need to support transparent navigation
- Need ultimate resolution on selected events
- need to run specialised algorithms
- work interactively
- Not affordable if uncontrolled
13HEP Metadata - Event Collections
Bookkeeping
Event tag collection Tag 1 5 0.3 Tag 2 2
1.2 Tag M 8 3.1
14Detector Conditions Data
- Reflects changes in state of the detector with
time - Event Data cannot be reconstructed or analyzed
without it - Versioning
- Tagging
- Ability to extract slices of data required to run
with job - Long life-time
Version
Tag1 definition
Time
15LHC Data Management Requirements
- Increasing focus on maintainability and change
management for core software due to long LHC
lifetime - anticipate changes in technology
- adapt quickly to changes in environment physics
focus - Common solutions will simplify considerably the
deployment and operation of data management in
centres distributed worldwide - Common persistency framework (POOL project)
- Interactive data analysis framework (ROOT
project) - Strong involvement of experiments from the
beginning required to provide requirements - some experimentalists participate directly in
POOL - some work with software providers on integration
in experiment frameworks
16Common Persistency Framework (POOL)
- Provides persistency for C transient objects
- Supports transparent navigation between objects
across file and technology boundaries - without requiring user to explicitly open files
or database connections - Follows a technology neutral approach
- Abstract component C interfaces
- Insulates experiment software from concrete
implementations and technologies - Hybrid technology approach combining
- Streaming technology for complex C objects
(event data) - event data - typically write once, read many
(concurrent access simple) - Transaction-safe Relational Database (RDBMS)
services, - for catalogs, collections and other metadata
- Allows data to be stored in a distributed and
grid-enabled fashion - Integrated with an external File Catalog to keep
track of the file physical location, allowing
files to be moved or replicated
17Simulation
- Event Generators
- Programs to generate high-energy physics events
following the theory and models for a number of
physics aspects - Specialized Particle Decay Packages
- Simulation of particle decays using latest
experimental data - Detector Simulation
- Simulation of the passage of particles through
matter and electromagnetic fields - Detailed geometry and material descriptions
- Extensive list of physics processes based on
theory, data or parameterization - Detector responses
- Simulation of the detecting devices and
corresponding electronics
18Distributed Analysis
- Analysis will be performed with a mix of
official experiment software and private user
code - How can we make sure that the user code can
execute and provide a correct result wherever it
lands? - Input datasets not necessarily known a-priori
- Possibly very sparse data access pattern when
only a very few events match the query - Large number of people submitting jobs
concurrently and in an uncoordinated fashion
resulting into a chaotic workload - Wide range of user expertise
- Need for interactivity - requirements on system
response time rather than throughput - Ability to suspend an interactive session and
resume it later, in a different location
19Data Analysis The Spectrum
- From Batch Physics Analysis
- Run on the complete data set (from TB to PB)
- Reconstruction of non-visible particles from
decay products - Classification Events based on physical
properties - Several non-exclusive data streams with summary
information (event tags) (from GB to TB) - Costly operation
- To Interactive Physics Analysis
- Final event selection and refinements (few GB)
- Histograms, N-tuples, Fitting models
- Data Visualization
- Scripting and GUI
20Experiment Specific Analysis Frameworks
- Development of Event models and high level
analysis tools specific to the experiment physics
goals - Example DaVinci (LHCb)
21ROOT
- The ROOT system is an Object Oriented framework
for large scale data analysis written in C - It includes among others
- Efficient object persistency facilities
- C interpreter
- Advanced statistical analysis (multi dimensional
histogramming, fitting, minimization, cluster
finding algorithms) and visualization tools - The user interacts with ROOT via a graphical user
interface the command line or batch scripts - The project started in 1995 and now is a very
mature system used by many physicists worldwide
22ROOT packages
23ROOT Graphics
24ROOT GUI
25ROOT Self-describing files
- Dictionary for persistent classes written to the
file when closing the file. - ROOT files can be read by foreign readers (eg
JavaRoot) - Support for Backward and Forward compatibility
- Automatic schema evolution
- Files created in 2003 must be readable in 2015
- Classes (data objects) for all objects in a file
can be regenerated
26ROOT Basic data types
- Histograms
- 1D, 2D, 3D and functions
- Ntuples
- support PAW-like ntuples
- PAW ntuples/histograms can be imported
- Trees
- Extension of Ntuples for Objects
- Collection of branches (branch has its own
buffer) - Can input partial Event
- Can have several Trees in parallel
- Chains
- Collections of Trees
27ROOT Memory lt--gt Tree
Memory
T.GetEntry(6)
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
T.Fill()
18
T
28Data Visualization
- Experiments develop interactive Event and
Geometry display programs - Help to develop detector geometries
- Help to develop pattern recognition algorithms
- Interactive data analysis and data presentation
- Ingredients
- GUI
- 3-D graphics
- Scripting
Example ORCA visualization in CMS (IGUANA)
29Used External Products
- Statistical Analysis Tools
- ROOT, GSL,
- Interactivity and User Interfaces
- Qt, Python, ROOT,
- Data Visualization and Graphics
- Coin, OpenGL,
- Distributed Applications
- PROOF, Globus, EDG,
- Foundation Libraries
- STL, Boost, CLHEP, Zlib,
- Mathematical Libraries
- NagC, GSL, CLHEP,
- Data Organization
- Oracle, MySQL, XercesC,
- Data Management Tools
- ROOT, Oracle, MySQL, EDG,
- Simulation Toolkits
- Pythia, Herwig, Geant4, Fluka,
30Summary
- HEP applications are characterized by
- Amounts and complexity of the data
- Large size and geographically dispersed nature of
the collaborations - Most of the algorithmic software written by
Physicists - Expected long lifetimes
- Development of Software Frameworks
- Ensure coherency in the Event data processing
applications - Make the life of Physicists easier by hiding most
of the technicalities - Withstand technology changes
- A variety of different software domains and
expertise required - Data Management, Simulation, Interactive
Visualization, Distributed Computing, etc. - Extensive use of third-party generic software
- Open-source products favored
31The LHC Detectors