Title: Parallel R (pR)
1Parallel R (pR)
- For High Performance Statistical Computing
- Nagiza F. Samatova (ORNL)
- Srikanth Yoginath (ORNL)
- Guruprasad Kora (ORNL)
- David Bauer (GT)
- Chongle Pan (UTK/ORNL)
SDM AHM _at_ Salt Lake City March 3-4, 2005
Contact Nagiza Samatova, samatovan_at_ornl.gov
2Outline
- About Parallel R
- Motivation
- About R and its parallelization efforts
- Task and data parallelism with Parallel R (pR)
- Extensibility of Parallel R
- Performance Benchmarks
- Parallel R across Different Applications
- GIS data analysis with GRASS and Parallel R
- Clustered Climate Regimes using Parallel R
- Fusion scenario challenges Parallel R
- Quantitative Proteomics in Biology using Parallel
R - Summary and Future Work
3Tera-(Flop Byte) Analyses Could Be Routine for
Scientific Applications But
- Algorithmic Complexity
- Calculate means O(n)
- Calculate FFT O(n log(n))
- Calculate PCA O(r c)
- Hierarchical clust. O(n2)
Climate Now 20-40TB per simulated year 5
yrs 100TB/yr 5-10PB/yr Astrophysics Now and
5 yrs Can soak up anything! Fusion Now
100Mbytes/15min 5 yrs 1000Mbytes/2 min
4Statistical Computing with R
- About R (http//www.r-project.org/)
- R is an Open Source (GPL), most widely used
programming environment for statistical analysis
and graphics similar to S. - Provides good support for both users and
developers. - Highly extensible via dynamically loadable
add-on packages. - Originally developed by Robert Gentleman and
Ross Ihaka.
gt library (rpvm) gt .PVM.start.pvmd () gt
.PVM.addhosts (...) gt .PVM.config ()
Towards Enabling Parallel Computing in R
- Rmpi (Hao Yu) R interface to LAM-MPI.
- rpvm (Na Li and Tony Rossini) R interface to
PVM requires knowledge of parallel programming. - snow (Luke Tierney) general API on top of
message passing routines to provide high-level
(parallel apply) commands mostly demonstrated
for embarrassingly parallel applications .
5Motivation behind Parallel R (pR)
- Ideal Programming Requirements
- Be able to use existing high level (i.e. R) code
- Require minimal extra efforts for parallelizing
- Have Identical/similar (presumably easy-to-use)
interface to Rs - Be able to test codes in sequential settings
- Provide efficient and scalable (in terms of
problem size and number of processors)
performance
6Providing Task and Data Parallelism in pR
7Extensibility of Parallel R (pR)
8Scalability of Parallel R (pR)
Rgt solve (A,B) pRgt sla.solve (A, B, NPROWS,
NPCOLS, MB) A and B are the input matrices
NPROWS and NPCOLS are process grid specs MB is
block size
9Overhead due to R Parallel Agent in pR
10Parallel R (pR) Distribution
http//www.ASPECT-SDM.org/Parallel-R
- Releases History
- pR enables both data and task parallelism
(includes task-pR and RScaLAPACK) (2004/Q4) - RScaLAPACK provides R interface to ScaLAPACK
with its scalability in terms of problem size and
number of processors using data parallelism
(2004/Q2) - task-pR achieves parallelism by performing
out-of-order execution of tasks. With its
intelligent scheduling mechanism it attains
significant gain in execution times (2004/Q3) - pMatrix provides a parallel platform to perform
major matrix operations in parallel using
ScaLAPACK and PBLAS Level II III routines
(2005/Q2)
Also Available for download from Rs CRAN web
site (www.R-Project.org) with 37 mirror sites in
20 countries
11Geo-statistical and Spatial Data Analysis with
GRASS and Parallel R
With George Fann, John Drake, and Bhaduri
Budhendra
- About GRASS (http//grass.itc.it/)
- GRASS (Geographic Resources Analysis Support
System) is a raster/vector GIS, image processing
system, and graphics production. - GRASS contains over 350 programs and tools to
render maps and images on monitor and paper
manipulate raster, vector, and sites data
process multi spectral image data create,
manage, and store spatial data. - It is Free (Libre) Software/Open Source released
under GNU GPL.
- Parallel R (pR) extension for GRASS
- Leverages the work by Markus Neteler
(http//grass.itc.it/statsgrass/grass_geostats.htm
l). - Offers a richer set of statistical analysis
capabilities including (Basic Statistics,
Exploratory Data Analysis, Linear Models,
Multivariate Analysis, Time Series Analysis,
etc.) - Provides high performance and parallel
computational platform for large datasets
12Grass/Parallel-R Examples
13Clustered Climate Regimes AnalysisWith W.
Hargrove, F. Hoffman, and D. Erickson
14Scalability of pk-means() in pR
15Fusion Scenario Challenges Parallel R With
George Ostrouchov and Don Batchelor
Mahalanobis Distance ? easy
250,000 points 10 sampling for 1hr analysis
Hierarchical Model-based Clustering (mclust) ?
hard
Expectation Maximization (EM) ? easy
16Quantitative Proteomics in BiologyWith Bob
Hettich, Hays McDonald, and Greg Hurst
17Ratio Calculations for 50,000 files
3. Calculate RatioSlope(Eigenvector)
2. Select Peak Window
- Subtract background noise from data
- Generate Covariance Chromatogram (red)
- Apply Savitzky-Golay Smoother (blue)
- Calculate cut-off for search (cyan)
- Find Window with Max. SN ratio (green)
18Ratio Estimation over 50,000 files
19Ratio Calculations with Parallel R
20Performance Results for Ratio Calculation
21Summary and Future Work
- Parallel R (pR) is an Open Source high
performance library for statistical computing in
R - It has been deployed in a number of applications
including climate, GIS, fusion, and biology - Future improvements in few major directions
- Demonstrate more application scenarios
- Add more libraries like RScaLAPACK, PMatrix (e.g.
pAlok?, pclust, pnetCDF) - Improve the performance (reduce overhead, memory
management) of Parallel Agent - Enhance features of Parallel Agent
- Support outside of Master-Slave model
- Better memory management strategies (one-sided
put(), get(), release(), etc.) - Support of parallel I/O over netCDF and HDF files