High Performance Scientific Data Analytics - PowerPoint PPT Presentation

About This Presentation
Title:

High Performance Scientific Data Analytics

Description:

High Performance Scientific Data Analytics – PowerPoint PPT presentation

Number of Views:130
Avg rating:3.0/5.0
Slides: 44
Provided by: your182
Learn more at: https://sdm.lbl.gov
Category:

less

Transcript and Presenter's Notes

Title: High Performance Scientific Data Analytics


1
High Performance Scientific Data Analytics
  • Nagiza F. Samatova, PhD
  • Department of Computer Science, NCSU
  • Computer Science and Mathematics Division, ORNL

2
Core Team
Paul Breimyer
Guru
Heshan
Collaborators Co-authors on papers, Scott
Klasky, Roselyne, Mladen, Arie and Alex, Marcia,
Bill Nevins, Bob Hettich, John Drake, Tony
Mezzacappa, etc.
Chandra
3
Publications
  • CRAN Samatova NF, Yoginath S, Kora G, Bauer D,
    http//cran.r-project.org/mirrors.html.
  • SciDAC-06 Samatova NF, Branstetter M, Ganguly
    AR, Hettich R, Khan S, Kora G, Li J, Ma X, Pan C,
    Shoshani A, Yoginath S, Journal of Physics
    Conference Series 46 (2006) 505509.
  • PDCS-05 Yoginath S, Samatova NF, Bauer D, Kora
    G, Fann G, Geist A, In Proceedings of the 18th
    International Conference on Parallel and
    Distributed Computing Systems (PDCS-2005),
    September 12 - 14, 2005, Las Vegas, Nevada.
  • AnalChem-06.a Pan C, Kora G, McDonald WH, Tabb
    DL, VerBerkmoes NC, Hurst GB, Pelletier DA,
    Samatova NF, Hettich RL, Anal Chem. 2006 Oct
    1578(20)7121-31.
  • AnalChem-06.b Pan C, Kora G, Tabb DL, Pelletier
    DA, McDonald WH, Hurst GB, Hettich RL, Samatova
    NF, Anal Chem. 2006 Oct 1578(20)7110-20.
  • TPAMI-05 Ostrouchov G, Samatova NF, IEEE
    Transactions on Pattern Analysis and Machine
    Intelligence, 271340-1343, 2005.
  • JCGS-07 Qu YM, Ostrouchov G, Yoginath S,
    Samatova NF, Journal of Computational and
    Graphical Statistics, 2007
  • MCP-08 Pan, C., Oda, Y., Lankford, P.K., Zhang,
    B., Samatova, N.F., Pelletier, D.A.,Harwood,
    C.S., Hettich, R.L.,Characterization of anaerobic
    catabolism of p-coumarate in Rhodopseudomonas
    palustris by integrating transcriptomics and
    quantitative proteomics." Mol Cell Proteomics,
    vol. 7, no. 5, pp. 938-48, 2008.
  • CSDA-07 Park BH, Ostrouchov G, Samatova NF.,
    Sampling streaming data with replacement. Comput.
    Stat. Data Anal., vol. 52, no. 2, pp. 750-762,
    2007
  • TVCG-07 Sisneros, R., Jones, C., Huang, J.,
    Gao, J., Park, B.H., Samatova, N.F., A
    multi-level cache model for run-time optimization
    of remote visualization." IEEE Trans Vis Computer
    Graph, vol. 13, no. 5, pp. 991-1003, Sep-Oct 2007
  • DPD-02 Samatova NF, Ostrouchov G, Geist A,
    Melechko AV., RACHET An efficient cover-based
    merging of clustering hierarchies from
    distributed datasets." Distrib. Parallel
    Databases,vol. 11, no. 2, pp. 157-180, Mar 2002
  • BIBM-08 Breimyer, P., Green, N., Kumar, V.,
    Samatova, N.F., \BioDEAL Biological
    data-evidence-annotation linkage system."
    Proceedings of the IEEE International Conference
    on Bioinformatics and Biomedicine (BIBM 2008),,
    Philadelphia, PA, USA, Nov. 7-9, 2008
  • Ma, X. Li, J. Samatova, N.F., \Automatic
    Parallelization of Scripting Languages Toward
    Transparent Desktop Parallel Computing."
    Proceedings of IEEE/ACS International Conference
    on Parallel and Distributed Processing Symposium
    (IPDPS 2007), pp. 1-6, 26-30 March, 2007

4
Publications (cont.)
  • Lin H, Ma X, Chandramohan P, Geist A, Samatova
    NF, Efficient Data Access for Parallel BLAST."
    Proceedings of 19th IEEE International Parallel
    and Distributed Processing Symposium (IPDPS
    2005), pp. 72, 04-08 April 2005
  • Yoginath S, Samatova NF, Bauer D, Kora G, Fann G,
    Geist A, RScaLAPACK High-performance parallel
    statistical computing with R and ScaLAPACK.
    Proceedings of the 18th International Conference
    on Parallel and Distributed Computing Systems
    (PDCS-2005), Sep 12-14, 2005, Las Vegas, Nevada.
  • Park BH, Ostrouchov G, Samatova NF,
    \Reservoir-based random sampling from data
    stream." Proceedings of the Fourth SIAM
    International Conference on Data Mining, Orlando,
    FL, April, 2004
  • Ostrouchov G, Samatova NF, Embedding methods and
    robust statistics for dimension reduction."
    COMPSTAT 2004 Proceedings in Computational
    Statistics, Physica-Verlag, A Springer Company,
    2004
  • Park, B.-H. Samatova, N.F., Ostrouchov, G.
    Geist, A., Xmap Fast dimension reduction
    algorithms for multivariate streamline data."
    Proceedings of the 6th International Workshop on
    High Performance Data Mining Pervasive and Data
    Stream Mining (in conjunction with Third
    International SIAM Conference on Data Mining),
    San Francisco, CA May 1-3, 2003.
  • Abu-Khzam FN, Samatova NF, Ostrouchov G, Langston
    MA, Geist GA, Distributed dimension reduction
    algorithms for widely dispersed dataa."
    Proceedings of the Fourteenth IASTED
    International Conference on Parallel and
    Distributed Computing and Systems (IASTED PDCS
    2002), p. 167-174, 2002, ACTA Press.
  • Qu Y, Ostrouchov G, Samatova NF, Geist A,
    Principal component analysis for dimension
    reduction in massive distributed data sets."
    Proceedings of the Second SIAM International
    Conference on Data Mining, p 4-9, April 2002
  • Samatova NF, Ostrouchov G, Geist A, Melechko AV,
    RACHET A new algorithm for mining
    multi-dimensional distributed datasets."
    Proceedings of the SIAM Third Workshop on Mining
    Scientific Datasets, Chicago, IL, April 2001
  • Samatova NF, Breimyer P, Kora G, Pan P, Yoginath
    S, \Parallel R for High Performance Analytics
    Applications to Biology." in Scientic Data
    Management, A. Shoshani and D. Rotem (editors),
    C. Kamath (co-editor), CRC Press/Taylor and
    Francis, 2008 (Coming soon)
  • Samatova, N.F., Branstetter, M., Ganguly, A.R.,
    Hettich, R., Khan, S., Kora, G., Li, J., Ma, X.,
    Pan, C.,Shoshani, A., S. Yoginath, \High
    performance statistical computing with parallel
    R Applications to biology and climate." Journal
    of Physics Conference Series, SciDAC 2006, v.
    46, p. 505-509, 2006.
  • Bethel W, Abram G, Sharf J, Frank R, Ahrens J,
    Samatova NF, Miller M, Interoperability of
    visualization software and data models is not an
    achievable goal." In Proceedingsof the IEEE
    Visualization, Seattle, Washington, October
    19-24, 2003, p. 607-610

5
Tonys Frustrations
Scientific Computing is not only
COMPUTE-INTENSIVE but also DATA-INTENSIVE.
  • Visualization
  • TSB, ParaView, EnSight, VisBench... Which one
    to choose? What if I want the best part of each
    one of them? Will they ever interoperate?
  • Will they support HDF directly? What about
    parallel I/O?
  • Will I have viz pipelines/features customized for
    TSI?
  • Multi-resolution, remote, collaborative,
    interactive, parallel, scalable
  • Data analysis
  • Will I have data analysis pipelines customized
    for TSI?
  • What features to extract?
  • Move from qualitative to quantitative validation
    and verification of models
  • Can I have a compact representation of entire
    simulation? How to compare simulations? Will
    data analysis be coupled w/ data archives?
  • Will data analysis be ever coupled with
    visualization?

6
More Frustrations
Tony wants to remain a Domain Expert NOT to
become a Jack of All Trades
  • Data Management Networking
  • Hydro-run 10243 produces terabytes per run
  • How to efficiently stream directly to-from HPSS?
  • PVFS, SRM, HRM How to utilize them?
  • Simultaneous transfer of data from simulation
    computer to data analysis/Viz. cluster
  • File I/O and data transfer take as much time and
    effort as simulation if not more, while limiting
    data size often results in rerun due to overly
    coarse sampling
  • What about data reduction/compression techniques?
    How aggressive can I be? Will it be enough? What
    about viz and data analysis running on reduced
    data? Will I still preserve the desired features?
  • How to efficiently utilize network resources
    including data staging, cataloging, scheduling of
    preprocessing data analysis viz tasks?

7
How to Make Tony Happy? Internet Plug-ins
for Ultrascale Computing?
Paraview
IEEE Viz-2003
8
End-to-End Data Analytics
9
Programmers Dilemma
Domain-specific (?)
Productivity
high-level languages
Scripting (R, Matlab, IDL)
Object Oriented (C, Java)
Functional languages (C, Fortran)
Assembly
low-level language
10
Towards High-Performance High-Level Languages
How do we get there? ? Parallelization
Domain-specific (?)
Productivity
high-level languages
Performance
Scripting (R, Matlab, IDL)
Object Oriented (C, Java)
Functional languages (C, Fortran)
Assembly
low-level language
11
One Hat Does NOT Fit AllParallel R for Data
Intensive Statistical Computing
  • Technical computing
  • Matrix and vector formulations

Statistical computing and graphics
http//www.r-project.org
  • Developed by R. Gentleman R. Ihaka
  • Expanded by community as open source
  • Extensible via dynamically loadable libs
  • Data Visualization and analysis platform
  • Image processing, vector computing

12
Statistical Computing with R
  • About R (http//www.r-project.org/)
  • Open source, most widely used for statistical
    analysis and graphics similar to S.
  • Extensible via dynamically loadable add-on
    packages.
  • Originally developed by R. Gentleman and R.
    Ihaka.

Towards Enabling Parallel Computing in R
  • snow (Luke Tierney) general API on top of
    message passing routines to provide high-level
    (parallel apply) commands mostly demonstrated
    for embarrassingly parallel applications.
  • Rmpi (Hao Yu) R interface to LAM-MPI.
  • rpvm (Na Li and Tony Rossini) R interface to
    PVM requires knowledge of parallel programming.

gt library (rpvm) gt .PVM.start.pvmd () gt
.PVM.addhosts (...) gt .PVM.config ()
13
Lessons Learned from R/Matlab ParallelizationInte
ractivity and High-Level Curse Blessing
pR
Back-end approach - data parallelism -
C/C/Fortran with MPI - RScaLAPACK (Samatova
et al, 2005)
high
Automatic parallelization - task parallelism
- task-pR (Samatova et al, 2004)
Abstraction Interactivity Productivity
Embarrassing parallelism - data parallelism -
snow (Tierney, Rossini, Li, Sevcikova, 2006)
Manual parallelization - message passing -
Rmpi (Hao Yu, 2006) -rpvm (Na Li Tony
Rossini, 2006)
Compiled approach - Matlab?C?automatic
parallelization
low
Packages http//cran.r-project.org/
14
Task and Data Parallelism in pR
15
pR Multi-tiered Architecture
Interactive R Client
16
pR in Use
  • Key Features of pR Users Perspective
  • Be able to use existing high level R code
  • Require minimal extra efforts for parallelizing
  • Have identical/similar (presumably easy-to-use)
    interface to Rs
  • Be able to test codes in sequential settings
  • Provide efficient and scalable (in terms of
    problem size and number of processors)
    performance
  • Integrate with Kepler as front-end interface

17
Scalability of pR RScaLAPACK
Rgt solve (A,B) pRgt sla.solve (A, B, NPROWS,
NPCOLS, MB) A, B are input matrices NPROWS and
NPCOLS are process grid specs MB is block size
116
111
106
99
83
59
Architecture SGI Altix at CCS of ORNL with 256
Intel Itanium2 processors at 1.5 GHz 8 GB of
memory per processor (2 TB system memory) 64-bit
Linux OS 1.5 TeraFLOPs/s theoretical total peak
performance.
18
Overhead due to R pR
19
C/C/Fortran Plug-in to pR
20
Serial pR Performance over Python and R
pR Improv. over Python
pR Improv. over R
pR
Comparing Method Performance in Seconds
21
RedHat and CRAN Distribution
22
End-to-End Data Analytics
23
Outreach Applications Publications
Across Science Applications
  • Biology Quantitative Proteomics (B. Hettich, G.
    Hurst, C. Harwood, C. Pan)
  • Climate Analysis of Extreme Events (M.
    Branstetter, A. Ganguly, S. Khan)
  • GIS GRASSpR (G. Fann, B. Budhend)
  • Fusion Scott Klasky, Bill Nevins

24
  • Subtract background noise from data
  • Generate Covariance Chromatogram
  • Apply Savitzky-Golay Smoother
  • Calculate cut-off for search
  • Find Window with Max. SN ratio
  • ..

ProRata http//www.MSProRata.org
25
ProRata Bringing pR to Biologists
DOE OBER Projects Using ProRata
  • J. Banfield, Bob Hettich AMD Nature-09
  • M. Buchanan CMCS Center Bioinformatics08
  • J. Mielenz BESC BioEnergy In-submission
  • C. Harwood, Bob Hettich R. palustris MCP-08

gt1,000 downloads
ProRata http//www.MSProRata.org
AnalChem-06.a, 06.b
26
  • About GRASS (grass.itc.it)
  • GRASS (Geographic Resources Analysis Support
    System) is a raster/vector GIS, image processing
    system, and graphics production.
  • GRASS contains over 350 programs and tools to
    render maps and images on monitor and paper
    manipulate raster, vector, and sites data
    process multi spectral image data create,
    manage, and store spatial data.
  • It is Free (Libre) Software/Open Source released
    under GNU GPL.

27
(No Transcript)
28
End-to-End Data Analytics
29
Programmatic Backend Access Via Web Services
Integration to Kepler
Kepler Workflow
30
Dashboard Interface to pR
Scott Klasky Roselyne Nobert
31
End-to-End Data Analytics
32
Parallel, Distributed and Streamline Algorithms
  • Clustering
  • RACHET REF, REF
  • Faisals
  • Dimension Reduction and Data Compression
  • Distributed PCA REF
  • Streamline XMap
  • RobustMap REF
  • Outlier/Extreme Event Detection
  • RobustMap REF
  • Modeling the Usual to Find the Unusual REF
  • Climate Extreme Events SciDAC-06
  • Streamline Sampling
  • With replacement REF, REF
  • Parallel Graph Mining

33
RACHET Distributed Hierarchical Clustering
1. Generate Local Dendogram
2. Transmit
Send the code NOT the data
RACHET
3. Merge
4. Visualize
Centroid Descriptive Statistics
Merging Theorem for updating DS
Global Dendogram
Recursive Agglomeration of Clustering Hierarchies
by Encircling Tactic (RACHET)
34
Distributed Streaming Dimension
ReductionMerging Information Rather Than Raw
Data
Stream of simulation data
tt2
new
Incremental update via fusion
  • Merge pivotal points only
  • Linear time for each chunk
  • 5 deviation from monolithic
  • Merge few PCs and local means
  • One time communication
  • Controlled variability preserved

35
Model the Usual to Find the Unusual
To reduce the data to detect extreme/specific
events in global context.
3. Reduce data to model parameters
4. Select extremes for global analysis
5. Cluster the extremes (4)
6. Map back to series
36
End-to-End Data Analytics
37
Climate Data Movement ESGSDM
38
mpiBLAST-pio Exploiting Parallel I/O
  • Publications IPDPS-05, SSDBM-08
  • Download http//mpiblast.lanl.gov or
    http//www.mpiblast.org
  • Collaborators Xiasong Ma, Heshan Lin, Wu Feng

39
End-to-End Data Analytics Summary
40
How to Make Sense of Data?Know Your Limits Be
Smart
Not humanly possible to browse a petabyte of
data. Analysis must reduce data to quantities of
interest.
Ultrascale Computations Must be smart about
which probe combinations to see! Physical
Experiments Must be smart about probe placement!
To see 1 percent of a petabyte at 10 megabytes
per second takes
35 8-hour days!
41
Looking into the FutureNSF Expedition
Nagiza Samatova Mladen Vouk Scott Klasky Alok
Choudhary Bertram Ludaescher
42
Concept-Driven Analytics
43
Generating Knowledge Hierarchies via In-X
Analytics
Climate Use Case
In-X devices/applications (white spheres)
produce Knowledge Layers (pyramid) for annotation
and further discussion by scientific social
sub-nets (smileys). L1 A supercomputer runs a
simulation and produces raw data (bottom pyramid
layer). L2 As the simulation proceeds, in-X
cloud is informed of the pending analytics. While
streaming time series to their destination,
cyberinfrastructure cloud on-the-fly segments
them (into 100 time points), fits polynomials
into each segment, reduces segments to a few
polynomial coefficients. In-networks reduced data
reaches remote destination, active disks. L3
Disks, while storing the data, perform in-disks
clustering to find similar points in
low-dimensional coefficient space (the usual) and
detect outliers to find local extremes (the
unusual). L4 Disks fit statistical models into
clusters of similar points (e.g., cluster
centroids, density). L5 Local/global extremes
for different variables are analyzed in memory
for cause-effect linkages. L6 Humanly and/or
automatically generated hypotheses are recorded
in community knowledgebases. L8 Databases,
while recording the predicted relationships and
hypotheses, compare, contrast, and link them to
prior knowledge. In-database comparative analysis
results are recorded.
44
Semantic Knowledge Annotation with BioDEAL
Write a Comment
User Comments (0)
About PowerShow.com