R in High Energy Physics A somewhat personal account

About This Presentation

Title:

R in High Energy Physics A somewhat personal account

Description:

Fermi National Accelerator Laboratory. Computing Division ... Use an unbinned maximum likelihood fitter (from MASS) Rprof significantly sped up fit (replace ... – PowerPoint PPT presentation

Number of Views:77

Avg rating:3.0/5.0

Slides: 34

Provided by: Adam131

Category:

more less

Transcript and Presenter's Notes

Title: R in High Energy Physics A somewhat personal account

1
R in High Energy Physics(A somewhat personal
account)
Adam Lyon Fermi National Accelerator
Laboratory Computing Division - DØ
Experiment PHYSTAT Workshop on Statistical
Software MSU - March, 2004

Outline
Some Background
Why is R interesting to us?
Some non-analysis examples
Using R in HEP
Some thoughts on where this can go

2
Some background on me

Graduate student on DØ (400 person Fermilab
experiment)
Marc Paterno and I were some of the first to use
C for analysis at DØ (days of PAW)
and the first DØ to use Bayesian statistics for
limit calculation
Postdoc on CLEO (Cornell) (200 person
experiment)
Used PAW, ROOT Mathematica for several analyses
Involved in experiment's transition to C

Back to DØ (now 700 person Fermilab experiment)
as an associate scientist in Computing Division
Used R for non-HEP analysis applications
Pondering (with Marc Paterno and Jim Kowalkowski
also of FNAL/CD) how R can be made useful in HEP
analyses

3
First use of R

Marc (C Statistics expert trouble maker)
came across R and showed it to Jim and myself.
Looked neat but didn't have any reason to use it
until
Monitoring of DØ's Data Handling System
DØ has 601 Terabytes of data on tape
SAM (DØ CDF joint project) is our
File storage system (knows where all files live)
File delivery system (gets those files to you
worldwide)
File cataloging system (stores meta-data for file
cataloging)
Analysis bookkeeping system (remembers what you
did)

4
Data Handling at DØ

SAM typically delivers 150 TB of data to users
per month
It's perhaps a 0th generation GRID
SAM is a very complicated system

No monitoring except for huge dumps of text and
log files
Monitoring was sorely needed -- lots of things
can go wrong
Usage statistics were needed for future planning
and discovery of bottlenecks

5
samTV
6
Monitoring with R

Turn a text file like this (from parsing big log
files)
station procId time event fromStation
dur
cabsrv1 2983599 1074593577 OpenFile enstore
9343
cabsrv1 2983599 1074604748 RequestNextFile NA
11171
cabsrv1 2983599 1074609598 OpenFile enstore
4850
cabsrv1 2983599 1074620392 RequestNextFile NA
10794
cabsrv1 2983599 1074620392 OpenFile
fnal-cabsrv1 0
cabsrv1 2983599 1074631505 RequestNextFile NA
11113
cab 3085189 1076666381 OpenFile cab 415
cab 3085189 1076673379 RequestNextFile NA
6998
cab 3085189 1076673379 OpenFile cab 0
cab 3085189 1076680426 RequestNextFile NA
7047
cab 3085189 1076680753 OpenFile enstore
327
cab 3085189 1076687836 RequestNextFile NA
7083
cab 3085189 1076687836 OpenFile enstore
0
cab 3085189 1076694821 RequestNextFile NA
6985
cab 3085189 1076695114 OpenFile cab
293
cab 3085189 1076702701 RequestNextFile NA
7587
cab 3085189 1076702701 OpenFile enstore
0

7
Into plots like this

R code
library(lattice)
d read.table("data.dat",
headT)
w data dataevent"OpenFile",
wmin wdur/60.0
bwPlot( fromStation min station,
dataw, subset(minlt60)
xlab"Minutes",
main"Wait Time for " )

8
Box and Whisker Plots
9
Why is R interesting to us?

Seems to be the "State of the Art" in statistics
Enormous library of user contributed add-on
packages
Huge number of statistical tests, fitting,
smoothing,
More advanced stuff too genetic algorithms,
support vector machines, kriging (would have been
useful for my thesis!)
Advanced graphics based on William Cleveland's
Visualizing Data
SQL (MySQL, Oracle, SqlLite, Postgres, ODBC), XML
Hooks to COM and CORBA
Interfaces for Python, Perl, Tk (GUIs), Java
Pretty easy interface to C, C, Fortran
Some nice conveniences (R can save its state)
It's multiplatform
It's free!

10
The R Language

"Not unlike S"
Author (John Chambers) received 1998 ACM Software
System Award
The ACM's citation notes that Dr. Chambers' work
"will forever alter the way people analyze,
visualize, and manipulate data . . . S is an
elegant, widely accepted, and enduring software
system, with conceptual integrity, thanks to the
insight, taste, and effort of John Chambers."
(http//www.acm.org/announcements/ss99.html)
I guess he did good!

11
What is the R/S Language Interesting?

"Programming with Data"
The fundamental purpose of the language (as I see
it) is to provide general tools for efficient
data manipulation and analysis while allowing
extensions to those tools to be programmed
easily.
Has a specific purpose. You wouldn't write your
online data acquisition system in R/S. But
analyzing output from online monitoring is
certainly a good task for it.

R/S is a functional language
vectorized functions, apply, lazy evaluation
R/S is an object oriented language (but with a
functional bent)
Functions with the same name are dispatched based
on argument types (has notions of inheritance and
other OO features)
Is R/S ideal? Don't know, but we've been very
surprised by how some complicated tasks can be
accomplished with astonishingly simple code

12
Some non-analysis examples

samTV Plot the mean wait times by file source
for each SAM station

gt nrow(w) 1 399135
gt w12,
station procId time event fromStation
dur min
1 cabsrv1 2983599 1074593577 OpenFile
enstore 934 155.7
2 cabsrv1 2983599 1074609598 OpenFile
enstore 4850 80.8
gt w.means aggregate(w, list(stationwstation,
srcwfromStation),
mean)
gt w.means12,station src x
1 cab cab 6.861695109
2 cabsrv1 cab 8.171100917

2.2 seconds
13
samTV cont'd

gt dotplot(src x station,dataw.means,
scaleslist(cex1.3), mainlist("Mean Process
Wait Times", cex1.5),xlablist("Wait time
(minutes)", cex1.5),cex1.7,
par.strip.text
list(cex1.7) )

14
Non-analysis Examples

Weve found R to be great for slogging through
text files and database query results to make
extremely useful and pretty plots

15
Non-analysis applications

Performance of DB server middleware (Marc
Paterno)
Data transfer speed. vs. data size for two
different servers
Fit to model of startup time plus constant
throughput

modpollux nls( speed alpha(1-alphabeta/(alph
abetamb)), dataclientpollux,,
startc(alpha2.0, beta0.50), traceT)
16
What have we learned so far?

There seems to be an "R way"
Do it the functional way!
Use the apply commands and vectorized functions
instead of for loops
Higher order functions
One of R's strengths is its user contributions
but this means some functionality is repeated
(e.g. three histogram functions -- albeit each
serves a slightly different purpose)
The learning curve is long (R can do lots!)
But there are extensive manuals, online
documentation, and published books and papers

17
R in HEP

We are aware of no one using R, or any other
statistical package, in the HEP community. Why?
Our needs are quite specific and
My Postdoc supervisor (Ed Thorndike) "Trust no
one"
"Or at least trust no one outside of HEP"
With very few exceptions, all of our scientific
software tools are written within the community.
Many people write their own, reinventing lots of
wheels
Most are unaware of tools from the statistics
community and how they could apply to us
Many of us (including me) have little to no
formal statistical training and had no exposure
to statistical tools (e.g. SAS, SPSS, MATLAB, R)

18
R in HEP

Maybe this is changing, a little
Root, the most widely used HEP analysis tool, has
TGraphSmooth which implements Loess smoother
(translated R functions into C)
Software is getting more complicated (we are
doing lots more than just whipping up quick and
dirty Fortran). Some realization that we can't do
it all ourselves (e.g. databases, SAM uses
consultants)
But problem our datasets tend to be huge

19
HEP datasets and R

R seems to want to hold everything in memory
(recently discovered externalVector haven't
tried it yet)
In HEP, we typically run successive skims to
reduce the data size (601 TB down to 100s of Meg
or a few Gig)
Hard trade offs between size and utility of skims
Usually skims are output to a more convenient
format (e.g. Root files)
For example, I use a 4th generation skim with 412
variables and 232K rows (1.9 Gig)
Even our last stage skims are probably too large
for R
Efficient handling of large datasets is one
reason why Root is very successful

20
Three strategies for reading HEP data in R

Realize that I don't need all 412 variables for
all rows in memory at the same time
In fact usually concentrate on just a few
variables at a time
Perform even further event requirements
If data is small enough, bring it into R
If can reduce data to something R can hold, bring
that subset of data into R -- have the full power
of R
perhaps this means using that data for awhile,
and loading a new set to tackle another aspect of
the problem
If can't even do above, then have some R
apparatus to read in data one row at a time and
update an R object (e.g. histograms) But you
don't get the full power of R

21
Reading Root files into R

Do it the R way!
root.apply("myTree", "myFile.root", myFunction)
C and R code written evenings of one weekend
(my wife was out of town, dog was asleep)
You supply an R function that receives an entry
from your Root file (as a list).
Function can make requirements on the data,
return nothing if fails
Function returns a new list of variables to pass
to R. Can be new derived variables not in the
Root entry
Return of root.apply is a data frame (an R
database)

22
Example -- selecting dielectrons

Select events with two good electrons
Only the EM and MET branches are needed
selectDiE function(entry)
Make dataframe of electron data
es as.data.frame( entryEM ) attach(es)
Make the requirements for a good electron
goodECuts ( id 10
abs(id) 11 )
pt gt 25.0 emfrac gt 0.9
fiducial1
If nothing passed, then stop
if ( ! any(goodECuts) ) return(NULL)
Get electron etas
etas abs(eta)

If no electrons had a good eta, then stop
if ( ! any(goodEtaCuts) ) return(NULL)
Get the list of electrons meeting all cuts
goodEs goodEtaCuts goodECuts
Now require that at least two electrons pass
goodEsDF esgoodEs,
if ( nrow(goodEsDF) lt 2 ) return(NULL)
Construct the return list
Get the ordering for the electrons
goodElectronsOrder order( -goodEsDFpt )
e1 goodEsDF goodElectronsOrder1,
names(e1) lt- paste("e1", names(e1), sep".")
e2 goodEsDF goodElectronsOrder2,

Join (AND) the requirements
R Function Definition
Make new data frame with passing electrons. Are
there 2 or more?
Turn electron data into an R data frame
Apply cuts to electrons. Returns a boolean vector
(T,T,F,T,F)
Cut entry if nothing passed
23
Example -- selecting dielectrons

Select events with two good electrons
Only the EM and MET branches are needed
selectDiE function(entry)
Make dataframe of electron data
es as.data.frame( entryEM )
Make the requirements for a good electron
goodECuts ( esid 10
abs(esid) 11 )
espt gt 25.0 esemfrac gt 0.9
esfiducial1
If nothing passed, then stop
if ( ! any(goodECuts) ) return(NULL)
Get electron etas
etas abs(eseta)

If no electrons had a good eta, then stop
if ( ! any(goodEtaCuts) ) return(NULL)
Get the list of electrons meeting all cuts
goodEs goodEtaCuts goodECuts
Now require that at least two electrons pass
goodEsDF esgoodEs,
if ( nrow(goodEsDF) lt 2 ) return(NULL)
Construct the return list
Get the ordering for the electrons
goodElectronsOrder order( -goodEsDFpt )
e1 goodEsDF goodElectronsOrder1,
names(e1) lt- paste("e1", names(e1), sep".")
e2 goodEsDF goodElectronsOrder2,

24
Analyzing dielectrons

d root.apply("Global", "mydata.root",
selectDiE)
Handed back a data frame with the variables I
wanted.
Can now attack this data with the full power of R

25
Dielectrons

gt d root.apply()
gt given.met equal.counts(dmet,
number4, overlap0.1)
gt summary(given.met)Intervals
min max count
1 0.05888367 4.103577 3044
2 3.84661865 6.498108 3045
3 6.21868896 9.914124 3046
4 9.42181396 88.125061 3043
Ovrlap between adjacent intervals
1 307 306 308
gt xyplot(e2.pt e1.pt given.met,
datad)

26
Extracting signal and background from data

(From Marc Paterno)
Given a data sample, extract the amount of signal
and background
Bump fitting
A common HEP problem
Try a MC example
Generate data based on a signal distribution
(Breit-Wigner Cauchy of mass and width) and a
background distribution (1/(abx)3)
Fit this data with the background and signal
distributions, but with unknown parameters

27
Bump Fitting

Generate the background distribution
bf returns a function that when given a uniform
random variable 0,1) returns the background
distribution with parameters a and b
rbackground generates the distribution for n
values
Clever use of higher order functions and
vectorized functions

bf function(a,b)
function(x)
temp1-x
temp(a/b)(tempsqrt(temp))
rbackground function(n, a, b)
transform bf(a,b)
transform(runif(n))

28
Bump Fitting

Generate the signal
Generate n random Breit-Wigner values
Require that distribution be positive and less
than max. Throw away values that fail
Recursively call function to make up the amount
that was lost
Make the data
Join the signal and background into one
distribution

rsignal function(n, mass,
width, max)
temp rcauchy(n,mass,width)
temp temptemp gt 0 temp lt max
num.more n - length(temp)
if (num.more gt 0)
more rsignal( n-length(temp),
mass, width, max)
temp append(temp, more)
temp
rexperiment function (nsig, mass,
width, nback, a, b)
append(rsignal(nsig, mass, width,
ab/2),
rbackground(nback, a, b))

29
Bump fitting

Use an unbinned maximum likelihood fitter (from
MASS)
Rprof significantly sped up fit (replace )

dbackground function(x, a, b)
d abx
2aab/(ddd)
mydistr function(x, f, m, s, a,
b)
(1-f)dbackground(x,a,b) fdcauchy(x,m,s)
fres2 fitdistr(data, densfunmydistr,
startlist(f FRAC, m40.0, s3.0, a100,
b2.))

30
Bump Fitting

True distribution total
Histogram is generated data
Signal fit
Background fit
Total fit
Bottom plot is of residuals (true-fit)

31
What are we considering next?

Continue to learn more about R
Further Develop the "Three Strategies"
Explore doing a physics analysis in R

Summary
We explore using R, a statistical analysis
package from the statistics community, in an HEP
enviornment
R has already proven useful for analyzing
monitoring and benchmarking data
We have ideas on how R can be used to read large
datasets
We've done some "proof of principle" studies of
physics analysis with R
As we learn more about R, we expect to be more
surprised at its capabilities

32
Options for R and Root Interfacing(after
discussions)

no interest from R community in non-I/O functions
of Root
In order of work required
1) R and Root remain separate-- use the more
appropriate tool for the task. Use text files to
communicate between the two if necessary.
2) Root loads R's math and low level statistical
libraries as shared objects
Minimalist approach for some functionality
Some access to the math and statistics C code
functions from R
These C functions take basic C types, so no
translation necessary
But no upper level functions written in the R
language available

3) R and Root remain separate, but
R package to read Root Trees
directly into R data frames.
Still use best tool for particular task
Now easier to get HEP data into R
4) Allow calling of selected high level R
functions from within Root
Root runs the R interpreter
translation is necessary
R functions understand Root objects
Root understand R return objects
Expose only some R functions
may reduce amount of translation

33
More Advanced Integration Options

5) R prompt from the Root prompt
R needs seamless knowledge of objects in current
Root session
At end of R session, new R variables translated
into Root objects
Root runs the R interpreter
Translation for all types of Root variables into
R and all types of R variables returned to Root.
A major undertaking
6) Root prompt from within R
Harder than 5 R is C but Root is C
I don't see much interest in this