Anaphe OO Libraries for Data Analysis using C and Python PowerPoint PPT Presentation

presentation player overlay
1 / 64
About This Presentation
Transcript and Presenter's Notes

Title: Anaphe OO Libraries for Data Analysis using C and Python


1
AnapheOO Libraries for Data Analysis using C
and Python
Andreas Pfeiffer CERN IT/API andreas.pfeiffer_at_cer
n.ch
2
Outline
  • Motivation
  • LHC computing challenge
  • Anaphe Components
  • C
  • Lizard Interactive Data Analysis
  • Python
  • Software quality control
  • Summary

3
  • LHC Computing challenge

4
LHC The Alps
Interaction Points
100m deep
27km circumference
5
The Large Hadron Collider
  • A completely new particle collider (start-up in
    2006)
  • the largest superconductor installation in the
    world
  • A collision will take place every 25 nanoseconds
  • But only one in a billion will be interesting
  • And only one in a trillion will be really
    interesting !!!
  • Real-time data filtering Petabytes per second
    to Gigabytes per second
  • Accumulated data Petabytes per year
  • Data mining by thousands of geographically
    dispersed scientists in hundreds of teams

6
LHC Computing Challenge
  • 4 experiments will create huge amount of data
  • gt1 PetaByte/year for each experiment !
  • 1015 Bytes
  • 1,000 TeraBytes
  • 20,000 Redwood tapes
  • 100,000 dual-sided DVD-RAM disks
  • 1,500,000 sets of the Encyclopaedia Britannica
    (w/o photos)
  • Need lots of CPU power to reconstruct/analyse
  • about 1000 PC boxes per experiment (2005 ones !)
  • 40.000 of todays boxes (dual P-III 800 MHz)
  • complex data models
  • reconstruction s/w is also used for online
    filtering
  • needs high quality s/w in order not to waste beam
    time

7
Lifetime of LHC software 25 yrs
WWW
Thanks to Dino Ferrero Merlino(IT)
8
Technology (R)Evolution
  • 10 yrs major cycle length (HW,SW,OS)
  • 12 evolutionary changes in the market
  • 1 revolutionary change
  • towards greater diversity
  • dont forget changes of requirements
  • Consequences
  • s/w written today most probably will be rewritten
    tomorrow
  • we must anticipate changes

9
Anaphe what it is
  • Modular (OO/C) replacement of CERNLIB
    functionality for use in HEP experiments
  • memory management
  • I/O
  • foundation classes
  • histogramming
  • minimizing/fitting
  • visualization
  • interactive data analysis
  • Trying to use standards wherever possible
  • Trying to re-use existing class libraries
  • This talk will not cover detector simulation
    (GEANT-4)

10
Anaphe Components
11
Use of Components withAbstract Interfaces
  • User Code uses only Interface classes
  • IHistogram1D hist histoFactory-gt
    create1D(track quality, 100, 0., 10.)
  • Actual implementations are selected at run-time
  • loading of shared libraries
  • No change at all to user code but keep freedom
    to choose implementation

Histo- Impl. 2
12
The AIDA project
  • AIDA project (Abstract Interfaces for Data
    Analysis) was initiated at the HepVis99 workshop
    in Orsay
  • Presently active mainly developers from existing
    packages
  • Tony Johnson (JAS)
  • Andreas Pfeiffer (Lizard/Anaphe)
  • Guy Barrand (OpenScientist )
  • Mark Dönszelmann (Wired)
  • Developers from LHCb/Gaudi
  • more on AIDA tomorrow ...

13
Layered Approach
  • Basic functionalities (histograms, fitting,
    etc.) are available as individual C class
    libraries.
  • Easy replacing one part without throwing away
    everything
  • Objectivity/DB to provide persistence
  • HepODBMS library (insulating layer, tags)
  • Histogram library (HTL)
  • Fitting libraries (Gemini, HepFitting)
  • Graphics libraries (Qt, Qplotter)
  • Insulate components through Abstract Interfaces
  • wrapper layer to implement Interfaces in terms
    of existing libs
  • Apply s/w quality control tools
  • code checking, testing

14
Anaphe Components Overview
15
Anaphe Internals Abstract Interfaces
16
Anaphe components
17
Basic 3D Graphic Libraries
  • OpenGL (basic graphics)
  • De-facto industry standard for basic 3D graphics
  • Used in CAD/CAE, games, VR, medical imaging
  • OpenInventor (scene mgmt.)
  • OO 3D toolkit for graphics
  • Cubes, polygons, text, materials
  • Cameras, lights, picking
  • 3D viewers/editors,animation
  • Based on OpenGL/MesaGL

18
2D Graphics libraries
  • Qt
  • multi-platform C GUI toolkit
  • C class library, not wrapper around C libs
  • superset of Motif and MFC
  • available on Unix and MS Windows
  • no change for developer
  • commercial but with public domain version
  • www.troll.no
  • Qplotter
  • add-on functionality for HEP
  • HIGZ/HPLOT

19
Mathematical Libraries
  • NAG (Numerical Algorithms Group) C Library
  • Covers a broad range of functionality
  • Linear algebra
  • differential equations
  • quadrature, etc.
  • Special functions of CERNLIB added to Mark-6
    release
  • mostly for theory and accelerator
  • Quality assurance
  • extensive testing done by NAG
  • www.nag.com

20
CLHEP - foundation classes
  • HEP foundation class library
  • Random number generators
  • Physics vectors
  • 3- and 4- vectors
  • Geometry
  • Linear algebra
  • System of units
  • more packages recently added
  • will continue to evolve
  • wwwinfo.cern.ch/asd/lhc/clhep/

21
Histograms the HTL package
  • Histograms are the basic tool for physics
    analysis
  • Statistical information of density distributions
  • Histogram Template Library (HTL)
  • design based on C templates
  • Modular separation between sampling and
    display
  • Extensible open for user defined binning
    systems
  • Flexible support transient/persistent at the
    same time
  • Open large use of abstract interfaces
  • recent addition 3D histograms

22
Fitting and Minimization
  • Fitting and Minimization Library (FML)
  • common OO interface
  • NAG-C, MINUIT
  • based on Abstract Interfaces
  • IVector, IModelFunction,
  • fitting as a special case of minimization
  • minimize distance between data and model
  • replacement for HepFitting (and Gemini)
  • Gemini
  • common interface to minimizer engine
  • very thin layer

23
  • Opening bracket
  • Persistency

24
Object persistencyTwo concepts serial and page
I/O
  • Sequential access to objects (streaming)
  • good in networking context or serial writes to
    file(s)
  • much like good old Fortran
  • often perceived to be simpler to implement
    (ltlt, gtgt)
  • Navigational access to objects (buffered)
  • I/O on demand for complex data models
  • location transparent (for user) access to object
  • typically by de-referencing of a smart pointer
  • optimized for (random) disk access (disks deliver
    pages)
  • sequential write to file(s) still ok
  • Both concepts need to take care about changes of
    the internal structure of the objects (schema
    evolution)

25
Architectural IssuePersistency (Object-I/O)
  • Brings a completely new quality into the design
  • Objects have now lifetime
  • dont delete until you really are sure you want
    to
  • persistency is kind of intended memory leak
  • would like to see no difference between memory
    and disk
  • Layout of objects may change during (extended)
    life
  • schema evolution
  • additions/deletions of attributes
  • changes of inheritance relations

26
Architectural IssuePersistency (Object-I/O)
(II)
  • Objects can be placed (clustering)
  • de-coupling of logical and physical view of data
  • Special care needed to ensure consistency in data
    set
  • avoid reading group of objects (tracks,
    events,...) for which writing/updating is not
    (yet) complete
  • clean up if only part of the objects are written
  • typically taken care of by using transactions
  • Complications possible in distributed computing
  • need to protect disk access now like memory
    access in past (Segmentation violation)

27
Physical Model and Logical Model
  • Physical model may be changed to optimise
    performance
  • Existing applications continue to work
    transparently !

28
Object Model
Thanks to Vincenzo Innocente (CMS)
29
Physical clustering
Thanks to Vincenzo Innocente (CMS)
30
  • Closing bracket
  • Persistency

31
Tags, Ntuples and Events
  • Tags - a special kind of Ntuple
  • Always associated with an underlying persistent
    store
  • Tags may be used to store ntuple-like data
  • extracted from all over the event
  • minPt, maxEmiss, nJets, nMuon, trigger,
  • Main use speedup data selection for analysis
  • Tag simplifies selection without loosing
    complexity
  • Events more complex than a tree structure (CWN)
  • lots of cross-references between classes,
    containers
  • Association from the Tag to the Event may be used
    to navigate to any other part of the Event
  • even from an interactive visualization program

32
AIDA compliance of Anaphe
  • Presently (Anaphe 3.x) only AIDA 1.0 compliant
  • Plan to implement AIDA 2.2 Interfaces by end 2001
    (Anaphe 4.x)
  • initially as wrappers to existing
    interfaces/packages
  • Will maintain 3.x for some time
  • ensures stability for users
  • Development will concentrate on 4.x
  • while AIDA will evolve further
  • Similar timeschedule as JAS (Tony Johnson)
  • OpenScientist (Guy Barrand) already there

33
  • Lizard a tool for Interactive Data Analysis

34
Interactive Data Analysis
  • Aim OO replacement for PAW (at least)
  • analysis of ntuple-like data (Tags,
    Ntuples, )
  • visualisation of data (Histograms, scatter-plot,
    Vectors)
  • fitting of histograms (and other data)
  • access to experiment specific data/code
  • Maximize flexibility and re-use
  • Foresee customization/integration
  • allow use from within experiments s/w
  • Plan for extensions
  • code for now, design for the future
  • Ensure maintainability
  • use of s/w quality control tools

35
Lizard
  • Un tool di analisi interattiva AIDA compatibile
  • Python scripting
  • Visualizzazione con Qt
  • Istogrammi HTL (via AIDA)
  • Persistenza con Objectivity
  • Fitting con NAG Libraries (o Minuit)
  • Componenti disponibili come shared libraries
  • indipendenti dal linguaggio di scripting
  • si possono usare anche in programmi C (Geant4)

36
Scripting - why
  • Typical use of scripting is quite different from
    programming (reconstruction, analysis, ...)
  • history go back to where I was before
  • repetition/looping - with modifiable parameters
  • avoid one size fits all or using power-tool as
    hammer
  • rapid prototyping in scripting language
  • quick turn-around times
  • performance critical code in core language
  • exploit richer set of features/functionality
    (e.g. templates in C)
  • scripting languages usually less susceptible to
    changes than mainstream languages
  • potentially longer lifes

37
Python - why
  • Python - OO (scripting) language
  • no strange !-variables
  • sensitive to indentation
  • More easy for users
  • as Java
  • Lots of user supplied modules available and ready
    for use
  • scientific, numerics, graphics, GUI, network, OS,
    games, DBs,
  • example http//www.vex.net/parnassus/
  • Parnassus Totals 1173 items in 49 categories.
  • Also usable in Java (Jython)
  • used in JAS for scripting
  • minimize changes needed within AIDA compliant
    environments

38
Python - how
  • SWIG to (semi-) automatically create connection
    to chosen scripting language
  • allows flexibility to choose amongst several
    scripting languages
  • Python, Perl, Tcl, Guile, Ruby, (Java)
  • Very easy to use
  • swig -c -python -shadow -c myClass.h
  • create shared lib from myClass.cpp and
    myClass_wrap.c
  • start python and import myClass.h to use it
  • Very easy to extend
  • simply inherit from swiggified class in python
  • modifications can later be fed back into C
  • performance, type safety, special language
    features (templates),

39
PAW -gt Lizard translation
  • Ntuple projection Lizard
  • lizard --useHBook
  • -) nt ntm.findNtuple(higgscand.hbkcands)
  • -) nplot1D(nt, mass, quality5 cut gt 198)
  • Ntuple projection PAW
  • pawX11
  • pawgt h/file 1 higgscand.hbk
  • pawgt nt/pl 10.mass quality5.and.cutgt198
  • Assuming file higgscand.hbk contains ntuple with
    number 10 and title cands

Any valid C expression
40
Example script (ntuple)
get list of names of all tuples from
tuplemanager ntm.listTuples() nt1ntm.findNtuple(
Charm1) retrieve tuple by name create 1D
histos to project into h1hm.create1D(10, mass
,100, 0., 5000.) h2hm.create1D(20, mass for
pt1gt10 ,100, 0., 5000.) project the attribute
MASS" into histo h1 without cut
("") nt1.project1D( h1, , MASS) project
the attribute MASS" into histo h2 with cut
(PT1gt10") nt1.project1D( h2, PT1gt10 , MASS)
41
(No Transcript)
42
Lizard History and Present Status
  • Started after CHEP-2000
  • Full version out since June 2001
  • PAW like analysis functionality plus
  • on-demand loading of compiled code using shared
    libraries
  • gives full access to experiments analysis code
    and data
  • based on Abstract Interfaces
  • flexible and extensible
  • License free version since Sep. 2001
  • HBook for RWNtuples and Histogram storage
  • Minuit as minimizer engine

43
Users and Collaborations
  • AIDA spoken here!
  • IGUANA (CMS visualization)
  • GAUDI (LHCb/HARP) framework
  • ATHENA (Atlas) framework
  • Analyzer modules in Geant 4
  • JAS
  • Open Scientist
  • you?

44
  • Software quality control

45
Software quality control
  • Using tools for testing/checking has started
  • Insure, CodeWizard
  • Package dependencies Ignominy
  • Set of perl and shell scripts by Lassi Tuura
    (CMS)
  • Ignominy scans
  • Make dependency data produced by the compilers
    (.d files)
  • Source code for includes (resolved against the
    ones actually seen)
  • Shared library dependencies (ldd output)
  • Defined and required symbols (nm output)
  • And maps
  • Source code and binaries into packages
  • include dependencies into package dependencies
  • Unresolved/defined symbols into package
    dependencies

ignominy dishonour, disgrace, shame infamy the
condition of being in disgrace, etc. (Oxford
English Dictionary)
46
Ignominy Analysis of Anaphe
  • Distribution of tools and utilities for LHC era
    physics
  • Combination of commercial, free and HEP software
  • Claims to be a toolkit
  • Seems to live up to its toolkit claims
  • Good work on modularity
  • Clean design is evident in many places
  • Dependency diagrams often split naturally into
    functional units

Thanks to Lassi Tuura (CMS)
47
Package Metrics
  • Size total amount of source code (not
    normalised across projects!)
  • ACD average component dependency ( libraries
    linked in)
  • CCD cumulative component dependency
    sum of single-package component dependencies over
    whole release
  • NCCD Measure of CCD compared to a balanced
    binary tree
  • A good toolkits NCCD will be close to 1.0
  • lt 1.0 structure is flatter than a binary tree (
    independent packages)
  • gt 1.0 structure is more strongly coupled
    (vertical or cyclic)
  • Aim NCCD 1 for given software/functionality

Thanks to Lassi Tuura (CMS)
48
Metrics NCCD vs Cycles
Includes Fortran
ATLAS
  • NCCD (spaghetti index)
  • ? 1.0 good toolkit
  • lt 1.0 indep. packages
  • gt 1.0 strongly-coupled

ROOT
ORCA
G4
COBRA
Anaphe
IGUANA
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
49
Future enhancements
  • Access to other implementations of components
  • HBOOK CWNtuples
  • Reading of ROOT (gt V3.0) files
  • similar to Tony Johnsons (Java) RootIO package
  • AIDA Ntuple/Histo store
  • optimized for Ntuples, Histograms as (compressed)
    XML
  • Communication with Java tools/packages (JAS,
    Wired)
  • via AIDA
  • Adding other scripting languages
  • Perl , Tcl, cint ?

50
Challenge Distributed Computing
  • Motivation
  • move code to data
  • parallel analysis
  • Techniques
  • services via AI
  • late binding
  • plug-in architecture
  • End-user (Lizard)
  • look-and-feel of local analysis
  • RD started and first prototype available soon
  • CORBA based


51
Summary
  • The architecture of Anaphe shows some important
    items for flexible and modular data analysis
  • weak coupling between components through use of
    Abstract Interface
  • basic functionality is covered by individual C
    class libraries
  • emphasis on usability and maintainability
  • Major criteria are flexibility, extensibility and
    interoperability
  • recent example GEANT-4 examples (based on AIDA)
  • Lizard is an Interactive Data Analysis Tool based
    on Anaphe components and the Python scripting
    language (through SWIG)
  • Lizard is young but has very solid base in mature
    Anaphe libraries
  • real plug-in structure
  • Software quality control is important
  • tools help to optimize dependencies / minimize
    maintenance effort

52
More information
  • cern.ch/Anaphe
  • cern.ch/Anaphe/Lizard
  • aida.freehep.org/
  • cern.ch/DB
  • wwwinfo.cern.ch/asd/lhc/clhep/

53
  • Additional slides

54
Analysis of Geant4
  • Fairly large C project
  • Very fine-grained (and multi-level) package
    structuring
  • Seems quite clean from the preliminary analysis
  • Fine package subdivision helps in many ways but
    makes analysis and code understanding more
    complicated
  • One subsystemseems stronglycoupled andneeds
    attention
  • Need to studythe use of theinternal
    commandsystem

Thanks to Lassi Tuura (CMS)
55
Analysis of ROOT
  • ROOT developers have done a formidable job of
    breaking binary (shared library) dependencies,
    but
  • For example By static analysis, nothing seems to
    use the postscript package directly (no incoming
    dependencies), but there is this code
  • void TPadPrint (const char filename, Option_t
    option)
  • TVirtualPS psave gVirtualPS
  • if (gROOT-gtLoadClass("TPostScript","Postscript"))
    return
  • gROOT-gtProcessLineFast("new TPostScript()")
  • gVirtualPS-gtOpen(psname,pstype)
  • gVirtualPS-gtSetBit(kPrintingPS)
  • Taking these and global objects into account
    makes the dependency diagrams very different
  • Sign of fast growth? Need a next evolutionary
    step?
  • So coherent that replacing parts could get
    painful

Thanks to Lassi Tuura (CMS)
56
Analysis of ROOT
Binary Source Logical Real
Binary only
Thanks to Lassi Tuura (CMS)
57
Metrics NCCD vs ACD
ATLAS
ROOT
ORCA
G4
COBRA
IGUANA
Anaphe
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
58
Metrics NCCD vs Size
ATLAS
ROOT
ORCA
G4
COBRA
IGUANA
Anaphe
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
59
Metrics NCCD vs AID
ATLAS
ROOT
ORCA
COBRA
G4
Anaphe
IGUANA
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
60
Metrics Packages vs Size
ATLAS
ORCA
G4
COBRA
IGUANA
Anaphe
ROOT
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
61
Metrics Packages vs Size
ATLAS
ORCA
G4
COBRA
IGUANA
ROOT
Anaphe
Toolkits Frameworks
Thanks to Lassi Tuura (CMS)
62
Scripting in Lizard
Automatically generated by SWIG
AIDA Interfaces
User
Controller Shadow classes
C interfaces
Python
C implementations
Anaphe implementations
63
Software life cycle for LHC expts.
  • LHC starts 2006
  • at least 10 yr of running
  • additionally at least 5 yr of data analysis

64
Lifetime of LHC software 25 yrs
Write a Comment
User Comments (0)
About PowerShow.com