R in High Energy Physics A somewhat personal account - PowerPoint PPT Presentation

1 / 33
About This Presentation
Title:

R in High Energy Physics A somewhat personal account

Description:

Fermi National Accelerator Laboratory. Computing Division ... Use an unbinned maximum likelihood fitter (from MASS) Rprof significantly sped up fit (replace ... – PowerPoint PPT presentation

Number of Views:77
Avg rating:3.0/5.0
Slides: 34
Provided by: Adam131
Category:

less

Transcript and Presenter's Notes

Title: R in High Energy Physics A somewhat personal account


1
R in High Energy Physics(A somewhat personal
account)
Adam Lyon Fermi National Accelerator
Laboratory Computing Division - DØ
Experiment PHYSTAT Workshop on Statistical
Software MSU - March, 2004
  • Outline
  • Some Background
  • Why is R interesting to us?
  • Some non-analysis examples
  • Using R in HEP
  • Some thoughts on where this can go

2
Some background on me
  • Graduate student on DØ (400 person Fermilab
    experiment)
  • Marc Paterno and I were some of the first to use
    C for analysis at DØ (days of PAW)
  • and the first DØ to use Bayesian statistics for
    limit calculation
  • Postdoc on CLEO (Cornell) (200 person
    experiment)
  • Used PAW, ROOT Mathematica for several analyses
  • Involved in experiment's transition to C
  • Back to DØ (now 700 person Fermilab experiment)
    as an associate scientist in Computing Division
  • Used R for non-HEP analysis applications
  • Pondering (with Marc Paterno and Jim Kowalkowski
    also of FNAL/CD) how R can be made useful in HEP
    analyses

3
First use of R
  • Marc (C Statistics expert trouble maker)
    came across R and showed it to Jim and myself.
  • Looked neat but didn't have any reason to use it
    until
  • Monitoring of DØ's Data Handling System
  • DØ has 601 Terabytes of data on tape
  • SAM (DØ CDF joint project) is our
  • File storage system (knows where all files live)
  • File delivery system (gets those files to you
    worldwide)
  • File cataloging system (stores meta-data for file
    cataloging)
  • Analysis bookkeeping system (remembers what you
    did)

4
Data Handling at DØ
  • SAM typically delivers 150 TB of data to users
    per month
  • It's perhaps a 0th generation GRID
  • SAM is a very complicated system
  • No monitoring except for huge dumps of text and
    log files
  • Monitoring was sorely needed -- lots of things
    can go wrong
  • Usage statistics were needed for future planning
    and discovery of bottlenecks

5
samTV
6
Monitoring with R
  • Turn a text file like this (from parsing big log
    files)
  • station procId time event fromStation
    dur
  • cabsrv1 2983599 1074593577 OpenFile enstore
    9343
  • cabsrv1 2983599 1074604748 RequestNextFile NA
    11171
  • cabsrv1 2983599 1074609598 OpenFile enstore
    4850
  • cabsrv1 2983599 1074620392 RequestNextFile NA
    10794
  • cabsrv1 2983599 1074620392 OpenFile
    fnal-cabsrv1 0
  • cabsrv1 2983599 1074631505 RequestNextFile NA
    11113
  • cab 3085189 1076666381 OpenFile cab 415
  • cab 3085189 1076673379 RequestNextFile NA
    6998
  • cab 3085189 1076673379 OpenFile cab 0
  • cab 3085189 1076680426 RequestNextFile NA
    7047
  • cab 3085189 1076680753 OpenFile enstore
    327
  • cab 3085189 1076687836 RequestNextFile NA
    7083
  • cab 3085189 1076687836 OpenFile enstore
    0
  • cab 3085189 1076694821 RequestNextFile NA
    6985
  • cab 3085189 1076695114 OpenFile cab
    293
  • cab 3085189 1076702701 RequestNextFile NA
    7587
  • cab 3085189 1076702701 OpenFile enstore
    0

7
Into plots like this
  • R code
  • library(lattice)
  • d read.table("data.dat",
  • headT)
  • w data dataevent"OpenFile",
  • wmin wdur/60.0
  • bwPlot( fromStation min station,
  • dataw, subset(minlt60)
  • xlab"Minutes",
  • main"Wait Time for " )

8
Box and Whisker Plots
9
Why is R interesting to us?
  • Seems to be the "State of the Art" in statistics
  • Enormous library of user contributed add-on
    packages
  • Huge number of statistical tests, fitting,
    smoothing,
  • More advanced stuff too genetic algorithms,
    support vector machines, kriging (would have been
    useful for my thesis!)
  • Advanced graphics based on William Cleveland's
    Visualizing Data
  • SQL (MySQL, Oracle, SqlLite, Postgres, ODBC), XML
  • Hooks to COM and CORBA
  • Interfaces for Python, Perl, Tk (GUIs), Java
  • Pretty easy interface to C, C, Fortran
  • Some nice conveniences (R can save its state)
  • It's multiplatform
  • It's free!

10
The R Language
  • "Not unlike S"
  • Author (John Chambers) received 1998 ACM Software
    System Award
  • The ACM's citation notes that Dr. Chambers' work
    "will forever alter the way people analyze,
    visualize, and manipulate data . . . S is an
    elegant, widely accepted, and enduring software
    system, with conceptual integrity, thanks to the
    insight, taste, and effort of John Chambers."
  • (http//www.acm.org/announcements/ss99.html)
  • I guess he did good!

11
What is the R/S Language Interesting?
  • "Programming with Data"
  • The fundamental purpose of the language (as I see
    it) is to provide general tools for efficient
    data manipulation and analysis while allowing
    extensions to those tools to be programmed
    easily.
  • Has a specific purpose. You wouldn't write your
    online data acquisition system in R/S. But
    analyzing output from online monitoring is
    certainly a good task for it.
  • R/S is a functional language
  • vectorized functions, apply, lazy evaluation
  • R/S is an object oriented language (but with a
    functional bent)
  • Functions with the same name are dispatched based
    on argument types (has notions of inheritance and
    other OO features)
  • Is R/S ideal? Don't know, but we've been very
    surprised by how some complicated tasks can be
    accomplished with astonishingly simple code

12
Some non-analysis examples
  • samTV Plot the mean wait times by file source
    for each SAM station
  • gt nrow(w) 1 399135
  • gt w12,
  • station procId time event fromStation
    dur min
  • 1 cabsrv1 2983599 1074593577 OpenFile
    enstore 934 155.7
  • 2 cabsrv1 2983599 1074609598 OpenFile
    enstore 4850 80.8
  • gt w.means aggregate(w, list(stationwstation,
    srcwfromStation),
    mean)
  • gt w.means12,station src x
  • 1 cab cab 6.861695109
  • 2 cabsrv1 cab 8.171100917

2.2 seconds
13
samTV cont'd
  • gt dotplot(src x station,dataw.means,
  • scaleslist(cex1.3), mainlist("Mean Process
    Wait Times", cex1.5),xlablist("Wait time
    (minutes)", cex1.5),cex1.7,
  • par.strip.text
  • list(cex1.7) )

14
Non-analysis Examples
  • Weve found R to be great for slogging through
    text files and database query results to make
    extremely useful and pretty plots

15
Non-analysis applications
  • Performance of DB server middleware (Marc
    Paterno)
  • Data transfer speed. vs. data size for two
    different servers
  • Fit to model of startup time plus constant
    throughput

modpollux nls( speed alpha(1-alphabeta/(alph
abetamb)), dataclientpollux,,
startc(alpha2.0, beta0.50), traceT)
16
What have we learned so far?
  • There seems to be an "R way"
  • Do it the functional way!
  • Use the apply commands and vectorized functions
    instead of for loops
  • Higher order functions
  • One of R's strengths is its user contributions
  • but this means some functionality is repeated
    (e.g. three histogram functions -- albeit each
    serves a slightly different purpose)
  • The learning curve is long (R can do lots!)
  • But there are extensive manuals, online
    documentation, and published books and papers

17
R in HEP
  • We are aware of no one using R, or any other
    statistical package, in the HEP community. Why?
  • Our needs are quite specific and
  • My Postdoc supervisor (Ed Thorndike) "Trust no
    one"
  • "Or at least trust no one outside of HEP"
  • With very few exceptions, all of our scientific
    software tools are written within the community.
    Many people write their own, reinventing lots of
    wheels
  • Most are unaware of tools from the statistics
    community and how they could apply to us
  • Many of us (including me) have little to no
    formal statistical training and had no exposure
    to statistical tools (e.g. SAS, SPSS, MATLAB, R)

18
R in HEP
  • Maybe this is changing, a little
  • Root, the most widely used HEP analysis tool, has
    TGraphSmooth which implements Loess smoother
    (translated R functions into C)
  • Software is getting more complicated (we are
    doing lots more than just whipping up quick and
    dirty Fortran). Some realization that we can't do
    it all ourselves (e.g. databases, SAM uses
    consultants)
  • But problem our datasets tend to be huge

19
HEP datasets and R
  • R seems to want to hold everything in memory
  • (recently discovered externalVector haven't
    tried it yet)
  • In HEP, we typically run successive skims to
    reduce the data size (601 TB down to 100s of Meg
    or a few Gig)
  • Hard trade offs between size and utility of skims
  • Usually skims are output to a more convenient
    format (e.g. Root files)
  • For example, I use a 4th generation skim with 412
    variables and 232K rows (1.9 Gig)
  • Even our last stage skims are probably too large
    for R
  • Efficient handling of large datasets is one
    reason why Root is very successful

20
Three strategies for reading HEP data in R
  • Realize that I don't need all 412 variables for
    all rows in memory at the same time
  • In fact usually concentrate on just a few
    variables at a time
  • Perform even further event requirements
  • If data is small enough, bring it into R
  • If can reduce data to something R can hold, bring
    that subset of data into R -- have the full power
    of R
  • perhaps this means using that data for awhile,
    and loading a new set to tackle another aspect of
    the problem
  • If can't even do above, then have some R
    apparatus to read in data one row at a time and
    update an R object (e.g. histograms) But you
    don't get the full power of R

21
Reading Root files into R
  • Do it the R way!
  • root.apply("myTree", "myFile.root", myFunction)
  • C and R code written evenings of one weekend
    (my wife was out of town, dog was asleep)
  • You supply an R function that receives an entry
    from your Root file (as a list).
  • Function can make requirements on the data,
    return nothing if fails
  • Function returns a new list of variables to pass
    to R. Can be new derived variables not in the
    Root entry
  • Return of root.apply is a data frame (an R
    database)

22
Example -- selecting dielectrons
  • Select events with two good electrons
  • Only the EM and MET branches are needed
  • selectDiE function(entry)
  • Make dataframe of electron data
  • es as.data.frame( entryEM ) attach(es)
  • Make the requirements for a good electron
  • goodECuts ( id 10
  • abs(id) 11 )
  • pt gt 25.0 emfrac gt 0.9
  • fiducial1
  • If nothing passed, then stop
  • if ( ! any(goodECuts) ) return(NULL)
  • Get electron etas
  • etas abs(eta)
  • If no electrons had a good eta, then stop
  • if ( ! any(goodEtaCuts) ) return(NULL)
  • Get the list of electrons meeting all cuts
  • goodEs goodEtaCuts goodECuts
  • Now require that at least two electrons pass
  • goodEsDF esgoodEs,
  • if ( nrow(goodEsDF) lt 2 ) return(NULL)
  • Construct the return list
  • Get the ordering for the electrons
  • goodElectronsOrder order( -goodEsDFpt )
  • e1 goodEsDF goodElectronsOrder1,
  • names(e1) lt- paste("e1", names(e1), sep".")
  • e2 goodEsDF goodElectronsOrder2,

Join (AND) the requirements
R Function Definition
Make new data frame with passing electrons. Are
there 2 or more?
Turn electron data into an R data frame
Apply cuts to electrons. Returns a boolean vector
(T,T,F,T,F)
Cut entry if nothing passed
23
Example -- selecting dielectrons
  • Select events with two good electrons
  • Only the EM and MET branches are needed
  • selectDiE function(entry)
  • Make dataframe of electron data
  • es as.data.frame( entryEM )
  • Make the requirements for a good electron
  • goodECuts ( esid 10
  • abs(esid) 11 )
  • espt gt 25.0 esemfrac gt 0.9
  • esfiducial1
  • If nothing passed, then stop
  • if ( ! any(goodECuts) ) return(NULL)
  • Get electron etas
  • etas abs(eseta)
  • If no electrons had a good eta, then stop
  • if ( ! any(goodEtaCuts) ) return(NULL)
  • Get the list of electrons meeting all cuts
  • goodEs goodEtaCuts goodECuts
  • Now require that at least two electrons pass
  • goodEsDF esgoodEs,
  • if ( nrow(goodEsDF) lt 2 ) return(NULL)
  • Construct the return list
  • Get the ordering for the electrons
  • goodElectronsOrder order( -goodEsDFpt )
  • e1 goodEsDF goodElectronsOrder1,
  • names(e1) lt- paste("e1", names(e1), sep".")
  • e2 goodEsDF goodElectronsOrder2,

24
Analyzing dielectrons
  • d root.apply("Global", "mydata.root",
    selectDiE)
  • Handed back a data frame with the variables I
    wanted.
  • Can now attack this data with the full power of R

25
Dielectrons
  • gt d root.apply()
  • gt given.met equal.counts(dmet,
    number4, overlap0.1)
  • gt summary(given.met)Intervals
  • min max count
  • 1 0.05888367 4.103577 3044
  • 2 3.84661865 6.498108 3045
  • 3 6.21868896 9.914124 3046
  • 4 9.42181396 88.125061 3043
  • Ovrlap between adjacent intervals
  • 1 307 306 308
  • gt xyplot(e2.pt e1.pt given.met,
    datad)

26
Extracting signal and background from data
  • (From Marc Paterno)
  • Given a data sample, extract the amount of signal
    and background
  • Bump fitting
  • A common HEP problem
  • Try a MC example
  • Generate data based on a signal distribution
    (Breit-Wigner Cauchy of mass and width) and a
    background distribution (1/(abx)3)
  • Fit this data with the background and signal
    distributions, but with unknown parameters

27
Bump Fitting
  • Generate the background distribution
  • bf returns a function that when given a uniform
    random variable 0,1) returns the background
    distribution with parameters a and b
  • rbackground generates the distribution for n
    values
  • Clever use of higher order functions and
    vectorized functions
  • bf function(a,b)
  • function(x)
  • temp1-x
  • temp(a/b)(tempsqrt(temp))
  • rbackground function(n, a, b)
  • transform bf(a,b)
  • transform(runif(n))

28
Bump Fitting
  • Generate the signal
  • Generate n random Breit-Wigner values
  • Require that distribution be positive and less
    than max. Throw away values that fail
  • Recursively call function to make up the amount
    that was lost
  • Make the data
  • Join the signal and background into one
    distribution
  • rsignal function(n, mass,
    width, max)
  • temp rcauchy(n,mass,width)
  • temp temptemp gt 0 temp lt max
  • num.more n - length(temp)
  • if (num.more gt 0)
  • more rsignal( n-length(temp),
    mass, width, max)
  • temp append(temp, more)
  • temp
  • rexperiment function (nsig, mass,
    width, nback, a, b)
  • append(rsignal(nsig, mass, width,
    ab/2),
  • rbackground(nback, a, b))

29
Bump fitting
  • Use an unbinned maximum likelihood fitter (from
    MASS)
  • Rprof significantly sped up fit (replace )
  • dbackground function(x, a, b)
  • d abx
  • 2aab/(ddd)
  • mydistr function(x, f, m, s, a,
    b)
  • (1-f)dbackground(x,a,b) fdcauchy(x,m,s)
  • fres2 fitdistr(data, densfunmydistr,
    startlist(f FRAC, m40.0, s3.0, a100,
    b2.))

30
Bump Fitting
  • True distribution total
  • Histogram is generated data
  • Signal fit
  • Background fit
  • Total fit
  • Bottom plot is of residuals (true-fit)

31
What are we considering next?
  • Continue to learn more about R
  • Further Develop the "Three Strategies"
  • Explore doing a physics analysis in R
  • Summary
  • We explore using R, a statistical analysis
    package from the statistics community, in an HEP
    enviornment
  • R has already proven useful for analyzing
    monitoring and benchmarking data
  • We have ideas on how R can be used to read large
    datasets
  • We've done some "proof of principle" studies of
    physics analysis with R
  • As we learn more about R, we expect to be more
    surprised at its capabilities

32
Options for R and Root Interfacing(after
discussions)
  • no interest from R community in non-I/O functions
    of Root
  • In order of work required
  • 1) R and Root remain separate-- use the more
    appropriate tool for the task. Use text files to
    communicate between the two if necessary.
  • 2) Root loads R's math and low level statistical
    libraries as shared objects
  • Minimalist approach for some functionality
  • Some access to the math and statistics C code
    functions from R
  • These C functions take basic C types, so no
    translation necessary
  • But no upper level functions written in the R
    language available
  • 3) R and Root remain separate, but
  • R package to read Root Trees
  • directly into R data frames.
  • Still use best tool for particular task
  • Now easier to get HEP data into R
  • 4) Allow calling of selected high level R
    functions from within Root
  • Root runs the R interpreter
  • translation is necessary
  • R functions understand Root objects
  • Root understand R return objects
  • Expose only some R functions
  • may reduce amount of translation

33
More Advanced Integration Options
  • 5) R prompt from the Root prompt
  • R needs seamless knowledge of objects in current
    Root session
  • At end of R session, new R variables translated
    into Root objects
  • Root runs the R interpreter
  • Translation for all types of Root variables into
    R and all types of R variables returned to Root.
  • A major undertaking
  • 6) Root prompt from within R
  • Harder than 5 R is C but Root is C
  • I don't see much interest in this
  • Things get interesting starting at 3)
  • I have a version 0.0.1 prototype for reading Root
    trees into R.
  • Required for all options above 3.
  • Ill try to work on this as time permits
  • Both Root and R interface to Python
  • Translate with Python as intermediary?
  • Not sure if that's performant enough
Write a Comment
User Comments (0)
About PowerShow.com