Title: Anaphe OO Libraries for Data Analysis using C and Python
1 AnapheOO Libraries for Data Analysis using C
and Python
Andreas Pfeiffer CERN IT/API andreas.pfeiffer_at_cer
n.ch
2Outline
- Motivation
- LHC computing challenge
- Anaphe Components
- C
- Lizard Interactive Data Analysis
- Python
- Software quality control
- Summary
3 4LHC The Alps
Interaction Points
100m deep
27km circumference
5The Large Hadron Collider
- A completely new particle collider (start-up in
2006) - the largest superconductor installation in the
world - A collision will take place every 25 nanoseconds
- But only one in a billion will be interesting
- And only one in a trillion will be really
interesting !!! - Real-time data filtering Petabytes per second
to Gigabytes per second - Accumulated data Petabytes per year
- Data mining by thousands of geographically
dispersed scientists in hundreds of teams
6LHC Computing Challenge
- 4 experiments will create huge amount of data
- gt1 PetaByte/year for each experiment !
- 1015 Bytes
- 1,000 TeraBytes
- 20,000 Redwood tapes
- 100,000 dual-sided DVD-RAM disks
- 1,500,000 sets of the Encyclopaedia Britannica
(w/o photos) - Need lots of CPU power to reconstruct/analyse
- about 1000 PC boxes per experiment (2005 ones !)
- 40.000 of todays boxes (dual P-III 800 MHz)
- complex data models
- reconstruction s/w is also used for online
filtering - needs high quality s/w in order not to waste beam
time
7Lifetime of LHC software 25 yrs
WWW
Thanks to Dino Ferrero Merlino(IT)
8Technology (R)Evolution
- 10 yrs major cycle length (HW,SW,OS)
- 12 evolutionary changes in the market
- 1 revolutionary change
- towards greater diversity
- dont forget changes of requirements
- Consequences
- s/w written today most probably will be rewritten
tomorrow - we must anticipate changes
9Anaphe what it is
- Modular (OO/C) replacement of CERNLIB
functionality for use in HEP experiments - memory management
- I/O
- foundation classes
- histogramming
- minimizing/fitting
- visualization
- interactive data analysis
- Trying to use standards wherever possible
- Trying to re-use existing class libraries
- This talk will not cover detector simulation
(GEANT-4)
10Anaphe Components
11Use of Components withAbstract Interfaces
- User Code uses only Interface classes
- IHistogram1D hist histoFactory-gt
create1D(track quality, 100, 0., 10.) - Actual implementations are selected at run-time
- loading of shared libraries
- No change at all to user code but keep freedom
to choose implementation
Histo- Impl. 2
12The AIDA project
- AIDA project (Abstract Interfaces for Data
Analysis) was initiated at the HepVis99 workshop
in Orsay - Presently active mainly developers from existing
packages - Tony Johnson (JAS)
- Andreas Pfeiffer (Lizard/Anaphe)
- Guy Barrand (OpenScientist )
- Mark Dönszelmann (Wired)
- Developers from LHCb/Gaudi
- more on AIDA tomorrow ...
13Layered Approach
- Basic functionalities (histograms, fitting,
etc.) are available as individual C class
libraries. - Easy replacing one part without throwing away
everything - Objectivity/DB to provide persistence
- HepODBMS library (insulating layer, tags)
- Histogram library (HTL)
- Fitting libraries (Gemini, HepFitting)
- Graphics libraries (Qt, Qplotter)
- Insulate components through Abstract Interfaces
- wrapper layer to implement Interfaces in terms
of existing libs - Apply s/w quality control tools
- code checking, testing
14Anaphe Components Overview
15Anaphe Internals Abstract Interfaces
16Anaphe components
17Basic 3D Graphic Libraries
- OpenGL (basic graphics)
- De-facto industry standard for basic 3D graphics
- Used in CAD/CAE, games, VR, medical imaging
- OpenInventor (scene mgmt.)
- OO 3D toolkit for graphics
- Cubes, polygons, text, materials
- Cameras, lights, picking
- 3D viewers/editors,animation
- Based on OpenGL/MesaGL
182D Graphics libraries
- Qt
- multi-platform C GUI toolkit
- C class library, not wrapper around C libs
- superset of Motif and MFC
- available on Unix and MS Windows
- no change for developer
- commercial but with public domain version
- www.troll.no
- Qplotter
- add-on functionality for HEP
- HIGZ/HPLOT
19Mathematical Libraries
- NAG (Numerical Algorithms Group) C Library
- Covers a broad range of functionality
- Linear algebra
- differential equations
- quadrature, etc.
- Special functions of CERNLIB added to Mark-6
release - mostly for theory and accelerator
- Quality assurance
- extensive testing done by NAG
- www.nag.com
20CLHEP - foundation classes
- HEP foundation class library
- Random number generators
- Physics vectors
- 3- and 4- vectors
- Geometry
- Linear algebra
- System of units
- more packages recently added
- will continue to evolve
- wwwinfo.cern.ch/asd/lhc/clhep/
21Histograms the HTL package
- Histograms are the basic tool for physics
analysis - Statistical information of density distributions
- Histogram Template Library (HTL)
- design based on C templates
- Modular separation between sampling and
display - Extensible open for user defined binning
systems - Flexible support transient/persistent at the
same time - Open large use of abstract interfaces
- recent addition 3D histograms
22Fitting and Minimization
- Fitting and Minimization Library (FML)
- common OO interface
- NAG-C, MINUIT
- based on Abstract Interfaces
- IVector, IModelFunction,
- fitting as a special case of minimization
- minimize distance between data and model
- replacement for HepFitting (and Gemini)
- Gemini
- common interface to minimizer engine
- very thin layer
23- Opening bracket
- Persistency
24Object persistencyTwo concepts serial and page
I/O
- Sequential access to objects (streaming)
- good in networking context or serial writes to
file(s) - much like good old Fortran
- often perceived to be simpler to implement
(ltlt, gtgt) - Navigational access to objects (buffered)
- I/O on demand for complex data models
- location transparent (for user) access to object
- typically by de-referencing of a smart pointer
- optimized for (random) disk access (disks deliver
pages) - sequential write to file(s) still ok
- Both concepts need to take care about changes of
the internal structure of the objects (schema
evolution)
25Architectural IssuePersistency (Object-I/O)
- Brings a completely new quality into the design
- Objects have now lifetime
- dont delete until you really are sure you want
to - persistency is kind of intended memory leak
- would like to see no difference between memory
and disk - Layout of objects may change during (extended)
life - schema evolution
- additions/deletions of attributes
- changes of inheritance relations
26Architectural IssuePersistency (Object-I/O)
(II)
- Objects can be placed (clustering)
- de-coupling of logical and physical view of data
- Special care needed to ensure consistency in data
set - avoid reading group of objects (tracks,
events,...) for which writing/updating is not
(yet) complete - clean up if only part of the objects are written
- typically taken care of by using transactions
- Complications possible in distributed computing
- need to protect disk access now like memory
access in past (Segmentation violation)
27Physical Model and Logical Model
- Physical model may be changed to optimise
performance - Existing applications continue to work
transparently !
28Object Model
Thanks to Vincenzo Innocente (CMS)
29Physical clustering
Thanks to Vincenzo Innocente (CMS)
30- Closing bracket
- Persistency
31Tags, Ntuples and Events
- Tags - a special kind of Ntuple
- Always associated with an underlying persistent
store - Tags may be used to store ntuple-like data
- extracted from all over the event
- minPt, maxEmiss, nJets, nMuon, trigger,
- Main use speedup data selection for analysis
- Tag simplifies selection without loosing
complexity - Events more complex than a tree structure (CWN)
- lots of cross-references between classes,
containers - Association from the Tag to the Event may be used
to navigate to any other part of the Event - even from an interactive visualization program
32AIDA compliance of Anaphe
- Presently (Anaphe 3.x) only AIDA 1.0 compliant
- Plan to implement AIDA 2.2 Interfaces by end 2001
(Anaphe 4.x) - initially as wrappers to existing
interfaces/packages - Will maintain 3.x for some time
- ensures stability for users
- Development will concentrate on 4.x
- while AIDA will evolve further
- Similar timeschedule as JAS (Tony Johnson)
- OpenScientist (Guy Barrand) already there
33- Lizard a tool for Interactive Data Analysis
34Interactive Data Analysis
- Aim OO replacement for PAW (at least)
- analysis of ntuple-like data (Tags,
Ntuples, ) - visualisation of data (Histograms, scatter-plot,
Vectors) - fitting of histograms (and other data)
- access to experiment specific data/code
- Maximize flexibility and re-use
- Foresee customization/integration
- allow use from within experiments s/w
- Plan for extensions
- code for now, design for the future
- Ensure maintainability
- use of s/w quality control tools
35Lizard
- Un tool di analisi interattiva AIDA compatibile
- Python scripting
- Visualizzazione con Qt
- Istogrammi HTL (via AIDA)
- Persistenza con Objectivity
- Fitting con NAG Libraries (o Minuit)
- Componenti disponibili come shared libraries
- indipendenti dal linguaggio di scripting
- si possono usare anche in programmi C (Geant4)
36Scripting - why
- Typical use of scripting is quite different from
programming (reconstruction, analysis, ...) - history go back to where I was before
- repetition/looping - with modifiable parameters
- avoid one size fits all or using power-tool as
hammer - rapid prototyping in scripting language
- quick turn-around times
- performance critical code in core language
- exploit richer set of features/functionality
(e.g. templates in C) - scripting languages usually less susceptible to
changes than mainstream languages - potentially longer lifes
37Python - why
- Python - OO (scripting) language
- no strange !-variables
- sensitive to indentation
- More easy for users
- as Java
- Lots of user supplied modules available and ready
for use - scientific, numerics, graphics, GUI, network, OS,
games, DBs, - example http//www.vex.net/parnassus/
- Parnassus Totals 1173 items in 49 categories.
- Also usable in Java (Jython)
- used in JAS for scripting
- minimize changes needed within AIDA compliant
environments
38Python - how
- SWIG to (semi-) automatically create connection
to chosen scripting language - allows flexibility to choose amongst several
scripting languages - Python, Perl, Tcl, Guile, Ruby, (Java)
- Very easy to use
- swig -c -python -shadow -c myClass.h
- create shared lib from myClass.cpp and
myClass_wrap.c - start python and import myClass.h to use it
- Very easy to extend
- simply inherit from swiggified class in python
- modifications can later be fed back into C
- performance, type safety, special language
features (templates),
39PAW -gt Lizard translation
- Ntuple projection Lizard
- lizard --useHBook
- -) nt ntm.findNtuple(higgscand.hbkcands)
- -) nplot1D(nt, mass, quality5 cut gt 198)
- Ntuple projection PAW
- pawX11
- pawgt h/file 1 higgscand.hbk
- pawgt nt/pl 10.mass quality5.and.cutgt198
- Assuming file higgscand.hbk contains ntuple with
number 10 and title cands
Any valid C expression
40Example script (ntuple)
get list of names of all tuples from
tuplemanager ntm.listTuples() nt1ntm.findNtuple(
Charm1) retrieve tuple by name create 1D
histos to project into h1hm.create1D(10, mass
,100, 0., 5000.) h2hm.create1D(20, mass for
pt1gt10 ,100, 0., 5000.) project the attribute
MASS" into histo h1 without cut
("") nt1.project1D( h1, , MASS) project
the attribute MASS" into histo h2 with cut
(PT1gt10") nt1.project1D( h2, PT1gt10 , MASS)
41(No Transcript)
42Lizard History and Present Status
- Started after CHEP-2000
- Full version out since June 2001
- PAW like analysis functionality plus
- on-demand loading of compiled code using shared
libraries - gives full access to experiments analysis code
and data - based on Abstract Interfaces
- flexible and extensible
- License free version since Sep. 2001
- HBook for RWNtuples and Histogram storage
- Minuit as minimizer engine
43Users and Collaborations
- AIDA spoken here!
- IGUANA (CMS visualization)
- GAUDI (LHCb/HARP) framework
- ATHENA (Atlas) framework
- Analyzer modules in Geant 4
- JAS
- Open Scientist
- you?
44 45Software quality control
- Using tools for testing/checking has started
- Insure, CodeWizard
- Package dependencies Ignominy
- Set of perl and shell scripts by Lassi Tuura
(CMS) - Ignominy scans
- Make dependency data produced by the compilers
(.d files) - Source code for includes (resolved against the
ones actually seen) - Shared library dependencies (ldd output)
- Defined and required symbols (nm output)
- And maps
- Source code and binaries into packages
- include dependencies into package dependencies
- Unresolved/defined symbols into package
dependencies
ignominy dishonour, disgrace, shame infamy the
condition of being in disgrace, etc. (Oxford
English Dictionary)
46Ignominy Analysis of Anaphe
- Distribution of tools and utilities for LHC era
physics - Combination of commercial, free and HEP software
- Claims to be a toolkit
- Seems to live up to its toolkit claims
- Good work on modularity
- Clean design is evident in many places
- Dependency diagrams often split naturally into
functional units
Thanks to Lassi Tuura (CMS)
47Package Metrics
- Size total amount of source code (not
normalised across projects!) - ACD average component dependency ( libraries
linked in) - CCD cumulative component dependency
sum of single-package component dependencies over
whole release - NCCD Measure of CCD compared to a balanced
binary tree - A good toolkits NCCD will be close to 1.0
- lt 1.0 structure is flatter than a binary tree (
independent packages) - gt 1.0 structure is more strongly coupled
(vertical or cyclic) - Aim NCCD 1 for given software/functionality
Thanks to Lassi Tuura (CMS)
48Metrics NCCD vs Cycles
Includes Fortran
ATLAS
- NCCD (spaghetti index)
- ? 1.0 good toolkit
- lt 1.0 indep. packages
- gt 1.0 strongly-coupled
ROOT
ORCA
G4
COBRA
Anaphe
IGUANA
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
49Future enhancements
- Access to other implementations of components
- HBOOK CWNtuples
- Reading of ROOT (gt V3.0) files
- similar to Tony Johnsons (Java) RootIO package
- AIDA Ntuple/Histo store
- optimized for Ntuples, Histograms as (compressed)
XML - Communication with Java tools/packages (JAS,
Wired) - via AIDA
- Adding other scripting languages
- Perl , Tcl, cint ?
50Challenge Distributed Computing
- Motivation
- move code to data
- parallel analysis
- Techniques
- services via AI
- late binding
- plug-in architecture
- End-user (Lizard)
- look-and-feel of local analysis
- RD started and first prototype available soon
- CORBA based
51Summary
- The architecture of Anaphe shows some important
items for flexible and modular data analysis - weak coupling between components through use of
Abstract Interface - basic functionality is covered by individual C
class libraries - emphasis on usability and maintainability
- Major criteria are flexibility, extensibility and
interoperability - recent example GEANT-4 examples (based on AIDA)
- Lizard is an Interactive Data Analysis Tool based
on Anaphe components and the Python scripting
language (through SWIG) - Lizard is young but has very solid base in mature
Anaphe libraries - real plug-in structure
- Software quality control is important
- tools help to optimize dependencies / minimize
maintenance effort
52More information
- cern.ch/Anaphe
- cern.ch/Anaphe/Lizard
- aida.freehep.org/
- cern.ch/DB
- wwwinfo.cern.ch/asd/lhc/clhep/
53 54Analysis of Geant4
- Fairly large C project
- Very fine-grained (and multi-level) package
structuring - Seems quite clean from the preliminary analysis
- Fine package subdivision helps in many ways but
makes analysis and code understanding more
complicated - One subsystemseems stronglycoupled andneeds
attention - Need to studythe use of theinternal
commandsystem
Thanks to Lassi Tuura (CMS)
55Analysis of ROOT
- ROOT developers have done a formidable job of
breaking binary (shared library) dependencies,
but - For example By static analysis, nothing seems to
use the postscript package directly (no incoming
dependencies), but there is this code - void TPadPrint (const char filename, Option_t
option) - TVirtualPS psave gVirtualPS
- if (gROOT-gtLoadClass("TPostScript","Postscript"))
return - gROOT-gtProcessLineFast("new TPostScript()")
- gVirtualPS-gtOpen(psname,pstype)
- gVirtualPS-gtSetBit(kPrintingPS)
- Taking these and global objects into account
makes the dependency diagrams very different - Sign of fast growth? Need a next evolutionary
step? - So coherent that replacing parts could get
painful
Thanks to Lassi Tuura (CMS)
56Analysis of ROOT
Binary Source Logical Real
Binary only
Thanks to Lassi Tuura (CMS)
57Metrics NCCD vs ACD
ATLAS
ROOT
ORCA
G4
COBRA
IGUANA
Anaphe
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
58Metrics NCCD vs Size
ATLAS
ROOT
ORCA
G4
COBRA
IGUANA
Anaphe
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
59Metrics NCCD vs AID
ATLAS
ROOT
ORCA
COBRA
G4
Anaphe
IGUANA
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
60Metrics Packages vs Size
ATLAS
ORCA
G4
COBRA
IGUANA
Anaphe
ROOT
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
61Metrics Packages vs Size
ATLAS
ORCA
G4
COBRA
IGUANA
ROOT
Anaphe
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
62Scripting in Lizard
Automatically generated by SWIG
AIDA Interfaces
User
Controller Shadow classes
C interfaces
Python
C implementations
Anaphe implementations
63Software life cycle for LHC expts.
- LHC starts 2006
- at least 10 yr of running
- additionally at least 5 yr of data analysis
64Lifetime of LHC software 25 yrs