Parallel R (pR) - PowerPoint PPT Presentation

About This Presentation
Title:

Parallel R (pR)

Description:

Parallel R pR – PowerPoint PPT presentation

Number of Views:28
Avg rating:3.0/5.0
Slides: 22
Provided by: nagizas
Learn more at: https://sdm.lbl.gov
Category:
Tags: eel | parallel

less

Transcript and Presenter's Notes

Title: Parallel R (pR)


1
Parallel R (pR)
  • For High Performance Statistical Computing
  • Nagiza F. Samatova (ORNL)
  • Srikanth Yoginath (ORNL)
  • Guruprasad Kora (ORNL)
  • David Bauer (GT)
  • Chongle Pan (UTK/ORNL)

SDM AHM _at_ Salt Lake City March 3-4, 2005
Contact Nagiza Samatova, samatovan_at_ornl.gov
2
Outline
  • About Parallel R
  • Motivation
  • About R and its parallelization efforts
  • Task and data parallelism with Parallel R (pR)
  • Extensibility of Parallel R
  • Performance Benchmarks
  • Parallel R across Different Applications
  • GIS data analysis with GRASS and Parallel R
  • Clustered Climate Regimes using Parallel R
  • Fusion scenario challenges Parallel R
  • Quantitative Proteomics in Biology using Parallel
    R
  • Summary and Future Work

3
Tera-(Flop Byte) Analyses Could Be Routine for
Scientific Applications But
  • Algorithmic Complexity
  • Calculate means O(n)
  • Calculate FFT O(n log(n))
  • Calculate PCA O(r c)
  • Hierarchical clust. O(n2)

Climate Now 20-40TB per simulated year 5
yrs 100TB/yr 5-10PB/yr Astrophysics Now and
5 yrs Can soak up anything! Fusion Now
100Mbytes/15min 5 yrs 1000Mbytes/2 min
4
Statistical Computing with R
  • About R (http//www.r-project.org/)
  • R is an Open Source (GPL), most widely used
    programming environment for statistical analysis
    and graphics similar to S.
  • Provides good support for both users and
    developers.
  • Highly extensible via dynamically loadable
    add-on packages.
  • Originally developed by Robert Gentleman and
    Ross Ihaka.

gt library (rpvm) gt .PVM.start.pvmd () gt
.PVM.addhosts (...) gt .PVM.config ()
Towards Enabling Parallel Computing in R
  • Rmpi (Hao Yu) R interface to LAM-MPI.
  • rpvm (Na Li and Tony Rossini) R interface to
    PVM requires knowledge of parallel programming.
  • snow (Luke Tierney) general API on top of
    message passing routines to provide high-level
    (parallel apply) commands mostly demonstrated
    for embarrassingly parallel applications .

5
Motivation behind Parallel R (pR)
  • Ideal Programming Requirements
  • Be able to use existing high level (i.e. R) code
  • Require minimal extra efforts for parallelizing
  • Have Identical/similar (presumably easy-to-use)
    interface to Rs
  • Be able to test codes in sequential settings
  • Provide efficient and scalable (in terms of
    problem size and number of processors)
    performance

6
Providing Task and Data Parallelism in pR
7
Extensibility of Parallel R (pR)
8
Scalability of Parallel R (pR)
Rgt solve (A,B) pRgt sla.solve (A, B, NPROWS,
NPCOLS, MB) A and B are the input matrices
NPROWS and NPCOLS are process grid specs MB is
block size
9
Overhead due to R Parallel Agent in pR
10
Parallel R (pR) Distribution
http//www.ASPECT-SDM.org/Parallel-R
  • Releases History
  • pR enables both data and task parallelism
    (includes task-pR and RScaLAPACK) (2004/Q4)
  • RScaLAPACK provides R interface to ScaLAPACK
    with its scalability in terms of problem size and
    number of processors using data parallelism
    (2004/Q2)
  • task-pR achieves parallelism by performing
    out-of-order execution of tasks. With its
    intelligent scheduling mechanism it attains
    significant gain in execution times (2004/Q3)
  • pMatrix provides a parallel platform to perform
    major matrix operations in parallel using
    ScaLAPACK and PBLAS Level II III routines
    (2005/Q2)

Also Available for download from Rs CRAN web
site (www.R-Project.org) with 37 mirror sites in
20 countries
11
Geo-statistical and Spatial Data Analysis with
GRASS and Parallel R
With George Fann, John Drake, and Bhaduri
Budhendra
  • About GRASS (http//grass.itc.it/)
  • GRASS (Geographic Resources Analysis Support
    System) is a raster/vector GIS, image processing
    system, and graphics production.
  • GRASS contains over 350 programs and tools to
    render maps and images on monitor and paper
    manipulate raster, vector, and sites data
    process multi spectral image data create,
    manage, and store spatial data.
  • It is Free (Libre) Software/Open Source released
    under GNU GPL.
  • Parallel R (pR) extension for GRASS
  • Leverages the work by Markus Neteler
    (http//grass.itc.it/statsgrass/grass_geostats.htm
    l).
  • Offers a richer set of statistical analysis
    capabilities including (Basic Statistics,
    Exploratory Data Analysis, Linear Models,
    Multivariate Analysis, Time Series Analysis,
    etc.)
  • Provides high performance and parallel
    computational platform for large datasets

12
Grass/Parallel-R Examples
13
Clustered Climate Regimes AnalysisWith W.
Hargrove, F. Hoffman, and D. Erickson
14
Scalability of pk-means() in pR
15
Fusion Scenario Challenges Parallel R With
George Ostrouchov and Don Batchelor
Mahalanobis Distance ? easy
250,000 points 10 sampling for 1hr analysis
Hierarchical Model-based Clustering (mclust) ?
hard
Expectation Maximization (EM) ? easy
16
Quantitative Proteomics in BiologyWith Bob
Hettich, Hays McDonald, and Greg Hurst
17
Ratio Calculations for 50,000 files
3. Calculate RatioSlope(Eigenvector)
2. Select Peak Window
  • Subtract background noise from data
  • Generate Covariance Chromatogram (red)
  • Apply Savitzky-Golay Smoother (blue)
  • Calculate cut-off for search (cyan)
  • Find Window with Max. SN ratio (green)

18
Ratio Estimation over 50,000 files
19
Ratio Calculations with Parallel R
20
Performance Results for Ratio Calculation
21
Summary and Future Work
  • Parallel R (pR) is an Open Source high
    performance library for statistical computing in
    R
  • It has been deployed in a number of applications
    including climate, GIS, fusion, and biology
  • Future improvements in few major directions
  • Demonstrate more application scenarios
  • Add more libraries like RScaLAPACK, PMatrix (e.g.
    pAlok?, pclust, pnetCDF)
  • Improve the performance (reduce overhead, memory
    management) of Parallel Agent
  • Enhance features of Parallel Agent
  • Support outside of Master-Slave model
  • Better memory management strategies (one-sided
    put(), get(), release(), etc.)
  • Support of parallel I/O over netCDF and HDF files
Write a Comment
User Comments (0)
About PowerShow.com