Title: R in High Energy Physics A somewhat personal account
1R in High Energy Physics(A somewhat personal
account)
Adam Lyon Fermi National Accelerator
Laboratory Computing Division - DØ
Experiment PHYSTAT Workshop on Statistical
Software MSU - March, 2004
- Outline
- Some Background
- Why is R interesting to us?
- Some non-analysis examples
- Using R in HEP
- Some thoughts on where this can go
2Some background on me
- Graduate student on DØ (400 person Fermilab
experiment) - Marc Paterno and I were some of the first to use
C for analysis at DØ (days of PAW) - and the first DØ to use Bayesian statistics for
limit calculation - Postdoc on CLEO (Cornell) (200 person
experiment) - Used PAW, ROOT Mathematica for several analyses
- Involved in experiment's transition to C
- Back to DØ (now 700 person Fermilab experiment)
as an associate scientist in Computing Division - Used R for non-HEP analysis applications
- Pondering (with Marc Paterno and Jim Kowalkowski
also of FNAL/CD) how R can be made useful in HEP
analyses
3First use of R
- Marc (C Statistics expert trouble maker)
came across R and showed it to Jim and myself. - Looked neat but didn't have any reason to use it
until - Monitoring of DØ's Data Handling System
- DØ has 601 Terabytes of data on tape
- SAM (DØ CDF joint project) is our
- File storage system (knows where all files live)
- File delivery system (gets those files to you
worldwide) - File cataloging system (stores meta-data for file
cataloging) - Analysis bookkeeping system (remembers what you
did)
4Data Handling at DØ
- SAM typically delivers 150 TB of data to users
per month - It's perhaps a 0th generation GRID
- SAM is a very complicated system
- No monitoring except for huge dumps of text and
log files - Monitoring was sorely needed -- lots of things
can go wrong - Usage statistics were needed for future planning
and discovery of bottlenecks
5samTV
6Monitoring with R
- Turn a text file like this (from parsing big log
files) - station procId time event fromStation
dur - cabsrv1 2983599 1074593577 OpenFile enstore
9343 - cabsrv1 2983599 1074604748 RequestNextFile NA
11171 - cabsrv1 2983599 1074609598 OpenFile enstore
4850 - cabsrv1 2983599 1074620392 RequestNextFile NA
10794 - cabsrv1 2983599 1074620392 OpenFile
fnal-cabsrv1 0 - cabsrv1 2983599 1074631505 RequestNextFile NA
11113 - cab 3085189 1076666381 OpenFile cab 415
- cab 3085189 1076673379 RequestNextFile NA
6998 - cab 3085189 1076673379 OpenFile cab 0
- cab 3085189 1076680426 RequestNextFile NA
7047 - cab 3085189 1076680753 OpenFile enstore
327 - cab 3085189 1076687836 RequestNextFile NA
7083 - cab 3085189 1076687836 OpenFile enstore
0 - cab 3085189 1076694821 RequestNextFile NA
6985 - cab 3085189 1076695114 OpenFile cab
293 - cab 3085189 1076702701 RequestNextFile NA
7587 - cab 3085189 1076702701 OpenFile enstore
0
7Into plots like this
- R code
- library(lattice)
- d read.table("data.dat",
- headT)
- w data dataevent"OpenFile",
- wmin wdur/60.0
- bwPlot( fromStation min station,
- dataw, subset(minlt60)
- xlab"Minutes",
- main"Wait Time for " )
8Box and Whisker Plots
9Why is R interesting to us?
- Seems to be the "State of the Art" in statistics
- Enormous library of user contributed add-on
packages - Huge number of statistical tests, fitting,
smoothing, - More advanced stuff too genetic algorithms,
support vector machines, kriging (would have been
useful for my thesis!) - Advanced graphics based on William Cleveland's
Visualizing Data - SQL (MySQL, Oracle, SqlLite, Postgres, ODBC), XML
- Hooks to COM and CORBA
- Interfaces for Python, Perl, Tk (GUIs), Java
- Pretty easy interface to C, C, Fortran
- Some nice conveniences (R can save its state)
- It's multiplatform
- It's free!
10The R Language
- "Not unlike S"
- Author (John Chambers) received 1998 ACM Software
System Award - The ACM's citation notes that Dr. Chambers' work
"will forever alter the way people analyze,
visualize, and manipulate data . . . S is an
elegant, widely accepted, and enduring software
system, with conceptual integrity, thanks to the
insight, taste, and effort of John Chambers." - (http//www.acm.org/announcements/ss99.html)
- I guess he did good!
11What is the R/S Language Interesting?
- "Programming with Data"
- The fundamental purpose of the language (as I see
it) is to provide general tools for efficient
data manipulation and analysis while allowing
extensions to those tools to be programmed
easily. - Has a specific purpose. You wouldn't write your
online data acquisition system in R/S. But
analyzing output from online monitoring is
certainly a good task for it.
- R/S is a functional language
- vectorized functions, apply, lazy evaluation
- R/S is an object oriented language (but with a
functional bent) - Functions with the same name are dispatched based
on argument types (has notions of inheritance and
other OO features) - Is R/S ideal? Don't know, but we've been very
surprised by how some complicated tasks can be
accomplished with astonishingly simple code
12Some non-analysis examples
- samTV Plot the mean wait times by file source
for each SAM station
- gt nrow(w) 1 399135
- gt w12,
- station procId time event fromStation
dur min - 1 cabsrv1 2983599 1074593577 OpenFile
enstore 934 155.7 - 2 cabsrv1 2983599 1074609598 OpenFile
enstore 4850 80.8 - gt w.means aggregate(w, list(stationwstation,
srcwfromStation),
mean) - gt w.means12,station src x
- 1 cab cab 6.861695109
- 2 cabsrv1 cab 8.171100917
2.2 seconds
13samTV cont'd
- gt dotplot(src x station,dataw.means,
- scaleslist(cex1.3), mainlist("Mean Process
Wait Times", cex1.5),xlablist("Wait time
(minutes)", cex1.5),cex1.7, - par.strip.text
- list(cex1.7) )
14Non-analysis Examples
- Weve found R to be great for slogging through
text files and database query results to make
extremely useful and pretty plots
15Non-analysis applications
- Performance of DB server middleware (Marc
Paterno) - Data transfer speed. vs. data size for two
different servers - Fit to model of startup time plus constant
throughput
modpollux nls( speed alpha(1-alphabeta/(alph
abetamb)), dataclientpollux,,
startc(alpha2.0, beta0.50), traceT)
16What have we learned so far?
- There seems to be an "R way"
- Do it the functional way!
- Use the apply commands and vectorized functions
instead of for loops - Higher order functions
- One of R's strengths is its user contributions
- but this means some functionality is repeated
(e.g. three histogram functions -- albeit each
serves a slightly different purpose) - The learning curve is long (R can do lots!)
- But there are extensive manuals, online
documentation, and published books and papers
17R in HEP
- We are aware of no one using R, or any other
statistical package, in the HEP community. Why? - Our needs are quite specific and
- My Postdoc supervisor (Ed Thorndike) "Trust no
one" - "Or at least trust no one outside of HEP"
- With very few exceptions, all of our scientific
software tools are written within the community.
Many people write their own, reinventing lots of
wheels - Most are unaware of tools from the statistics
community and how they could apply to us - Many of us (including me) have little to no
formal statistical training and had no exposure
to statistical tools (e.g. SAS, SPSS, MATLAB, R)
18R in HEP
- Maybe this is changing, a little
- Root, the most widely used HEP analysis tool, has
TGraphSmooth which implements Loess smoother
(translated R functions into C) - Software is getting more complicated (we are
doing lots more than just whipping up quick and
dirty Fortran). Some realization that we can't do
it all ourselves (e.g. databases, SAM uses
consultants) - But problem our datasets tend to be huge
19HEP datasets and R
- R seems to want to hold everything in memory
- (recently discovered externalVector haven't
tried it yet) - In HEP, we typically run successive skims to
reduce the data size (601 TB down to 100s of Meg
or a few Gig) - Hard trade offs between size and utility of skims
- Usually skims are output to a more convenient
format (e.g. Root files) - For example, I use a 4th generation skim with 412
variables and 232K rows (1.9 Gig) - Even our last stage skims are probably too large
for R - Efficient handling of large datasets is one
reason why Root is very successful
20Three strategies for reading HEP data in R
- Realize that I don't need all 412 variables for
all rows in memory at the same time - In fact usually concentrate on just a few
variables at a time - Perform even further event requirements
- If data is small enough, bring it into R
- If can reduce data to something R can hold, bring
that subset of data into R -- have the full power
of R - perhaps this means using that data for awhile,
and loading a new set to tackle another aspect of
the problem - If can't even do above, then have some R
apparatus to read in data one row at a time and
update an R object (e.g. histograms) But you
don't get the full power of R
21Reading Root files into R
- Do it the R way!
- root.apply("myTree", "myFile.root", myFunction)
- C and R code written evenings of one weekend
(my wife was out of town, dog was asleep) - You supply an R function that receives an entry
from your Root file (as a list). - Function can make requirements on the data,
return nothing if fails - Function returns a new list of variables to pass
to R. Can be new derived variables not in the
Root entry - Return of root.apply is a data frame (an R
database)
22Example -- selecting dielectrons
- Select events with two good electrons
- Only the EM and MET branches are needed
-
- selectDiE function(entry)
- Make dataframe of electron data
- es as.data.frame( entryEM ) attach(es)
- Make the requirements for a good electron
- goodECuts ( id 10
- abs(id) 11 )
- pt gt 25.0 emfrac gt 0.9
- fiducial1
- If nothing passed, then stop
- if ( ! any(goodECuts) ) return(NULL)
- Get electron etas
- etas abs(eta)
- If no electrons had a good eta, then stop
- if ( ! any(goodEtaCuts) ) return(NULL)
-
- Get the list of electrons meeting all cuts
- goodEs goodEtaCuts goodECuts
- Now require that at least two electrons pass
- goodEsDF esgoodEs,
- if ( nrow(goodEsDF) lt 2 ) return(NULL)
- Construct the return list
- Get the ordering for the electrons
- goodElectronsOrder order( -goodEsDFpt )
- e1 goodEsDF goodElectronsOrder1,
- names(e1) lt- paste("e1", names(e1), sep".")
- e2 goodEsDF goodElectronsOrder2,
Join (AND) the requirements
R Function Definition
Make new data frame with passing electrons. Are
there 2 or more?
Turn electron data into an R data frame
Apply cuts to electrons. Returns a boolean vector
(T,T,F,T,F)
Cut entry if nothing passed
23Example -- selecting dielectrons
- Select events with two good electrons
- Only the EM and MET branches are needed
-
- selectDiE function(entry)
- Make dataframe of electron data
- es as.data.frame( entryEM )
- Make the requirements for a good electron
- goodECuts ( esid 10
- abs(esid) 11 )
- espt gt 25.0 esemfrac gt 0.9
- esfiducial1
- If nothing passed, then stop
- if ( ! any(goodECuts) ) return(NULL)
- Get electron etas
- etas abs(eseta)
- If no electrons had a good eta, then stop
- if ( ! any(goodEtaCuts) ) return(NULL)
-
- Get the list of electrons meeting all cuts
- goodEs goodEtaCuts goodECuts
- Now require that at least two electrons pass
- goodEsDF esgoodEs,
- if ( nrow(goodEsDF) lt 2 ) return(NULL)
- Construct the return list
- Get the ordering for the electrons
- goodElectronsOrder order( -goodEsDFpt )
- e1 goodEsDF goodElectronsOrder1,
- names(e1) lt- paste("e1", names(e1), sep".")
- e2 goodEsDF goodElectronsOrder2,
24Analyzing dielectrons
- d root.apply("Global", "mydata.root",
selectDiE) - Handed back a data frame with the variables I
wanted. - Can now attack this data with the full power of R
25Dielectrons
- gt d root.apply()
- gt given.met equal.counts(dmet,
number4, overlap0.1) - gt summary(given.met)Intervals
- min max count
- 1 0.05888367 4.103577 3044
- 2 3.84661865 6.498108 3045
- 3 6.21868896 9.914124 3046
- 4 9.42181396 88.125061 3043
- Ovrlap between adjacent intervals
- 1 307 306 308
- gt xyplot(e2.pt e1.pt given.met,
datad)
26Extracting signal and background from data
- (From Marc Paterno)
- Given a data sample, extract the amount of signal
and background - Bump fitting
- A common HEP problem
- Try a MC example
- Generate data based on a signal distribution
(Breit-Wigner Cauchy of mass and width) and a
background distribution (1/(abx)3) - Fit this data with the background and signal
distributions, but with unknown parameters
27Bump Fitting
- Generate the background distribution
- bf returns a function that when given a uniform
random variable 0,1) returns the background
distribution with parameters a and b - rbackground generates the distribution for n
values - Clever use of higher order functions and
vectorized functions
- bf function(a,b)
- function(x)
- temp1-x
- temp(a/b)(tempsqrt(temp))
-
-
- rbackground function(n, a, b)
- transform bf(a,b)
- transform(runif(n))
-
28Bump Fitting
- Generate the signal
- Generate n random Breit-Wigner values
- Require that distribution be positive and less
than max. Throw away values that fail - Recursively call function to make up the amount
that was lost - Make the data
- Join the signal and background into one
distribution
- rsignal function(n, mass,
width, max) - temp rcauchy(n,mass,width)
- temp temptemp gt 0 temp lt max
- num.more n - length(temp)
- if (num.more gt 0)
- more rsignal( n-length(temp),
mass, width, max) - temp append(temp, more)
-
- temp
-
- rexperiment function (nsig, mass,
width, nback, a, b) - append(rsignal(nsig, mass, width,
ab/2), - rbackground(nback, a, b))
29Bump fitting
- Use an unbinned maximum likelihood fitter (from
MASS) - Rprof significantly sped up fit (replace )
- dbackground function(x, a, b)
- d abx
- 2aab/(ddd)
-
- mydistr function(x, f, m, s, a,
b) - (1-f)dbackground(x,a,b) fdcauchy(x,m,s)
-
- fres2 fitdistr(data, densfunmydistr,
startlist(f FRAC, m40.0, s3.0, a100,
b2.))
30Bump Fitting
- True distribution total
- Histogram is generated data
- Signal fit
- Background fit
- Total fit
- Bottom plot is of residuals (true-fit)
31What are we considering next?
- Continue to learn more about R
- Further Develop the "Three Strategies"
- Explore doing a physics analysis in R
- Summary
- We explore using R, a statistical analysis
package from the statistics community, in an HEP
enviornment - R has already proven useful for analyzing
monitoring and benchmarking data - We have ideas on how R can be used to read large
datasets - We've done some "proof of principle" studies of
physics analysis with R - As we learn more about R, we expect to be more
surprised at its capabilities
32Options for R and Root Interfacing(after
discussions)
- no interest from R community in non-I/O functions
of Root - In order of work required
- 1) R and Root remain separate-- use the more
appropriate tool for the task. Use text files to
communicate between the two if necessary. - 2) Root loads R's math and low level statistical
libraries as shared objects - Minimalist approach for some functionality
- Some access to the math and statistics C code
functions from R - These C functions take basic C types, so no
translation necessary - But no upper level functions written in the R
language available
- 3) R and Root remain separate, but
- R package to read Root Trees
- directly into R data frames.
- Still use best tool for particular task
- Now easier to get HEP data into R
- 4) Allow calling of selected high level R
functions from within Root - Root runs the R interpreter
- translation is necessary
- R functions understand Root objects
- Root understand R return objects
- Expose only some R functions
- may reduce amount of translation
33More Advanced Integration Options
- 5) R prompt from the Root prompt
- R needs seamless knowledge of objects in current
Root session - At end of R session, new R variables translated
into Root objects - Root runs the R interpreter
- Translation for all types of Root variables into
R and all types of R variables returned to Root. - A major undertaking
- 6) Root prompt from within R
- Harder than 5 R is C but Root is C
- I don't see much interest in this
- Things get interesting starting at 3)
- I have a version 0.0.1 prototype for reading Root
trees into R. - Required for all options above 3.
- Ill try to work on this as time permits
- Both Root and R interface to Python
- Translate with Python as intermediary?
- Not sure if that's performant enough