NetSolve - PowerPoint PPT Presentation

1 / 21
About This Presentation
Title:

NetSolve

Description:

Masking complexity related to distributed computing. Computation-Sharing Models Proxy Computing ... Integration with Condor. Integration with Ninf. Conclusion ... – PowerPoint PPT presentation

Number of Views:16
Avg rating:3.0/5.0
Slides: 22
Provided by: valueds184
Category:
Tags: netsolve | condor

less

Transcript and Presenter's Notes

Title: NetSolve


1
NetSolve
  • Henri Casanova and Jack Dongarra
  • University of Tennessee and Oak Ridge National
    Laboratory
  • http//www.cs.utk.edu/netsolve

2
Objectives
  • Harnessing vast computational resources on the
    network
  • Hardware
  • Software
  • Convenient for scientific computing community
  • Reducing installation and programming overhead
  • Masking complexity related to distributed
    computing

3
Computation-Sharing Models Proxy Computing
4
Computation-Sharing ModelsCode Shipping
Code
Code
Data
Client
Server
Computation on the client
5
Computation-Sharing ModelsRemote Computation
Data
Data
Code
Client
Server
Computation on the server
6
Design issues
  • Platform independence to accommodate
    heterogeneity
  • User friendly
  • Extensibility
  • Load balancing
  • Fault tolerance

7
NetSolve Architecture
OS
Resources
8
NetSolve Organization and Operation
9
NetSolve Client Interface
  • C, Fortran, Java, Matlab, and Mathematica
  • gtgt a rand(100) b rand(100,1)
  • gtgt x netsolve(ax b, a, b)
  • gtgt a rand(100) b rand(100,1)
  • gtgt request netsolve_nb (send, ax b, a,
    b)
  • gtgt x netsolve_nb(probe, request)
  • Not ready
  • gtgt x netsolve_nb(wait, request)

10
NetSolve Wrappers
  • Problem description file for extensibility
  • _at_PROBLEM ipars
  • _at_INCLUDE ipars.h
  • _at_LIB /home/user/lib/libipars.a
  • _at_DECRIPTION
  • Parallel Sub-Surface Flow Simulator
  • _at_INPUT 2
  • _at_OBJECT STRING CHAR model
  • _at_OBJECT FILE CHAR infile
  • Compiled into wrappers around scientific
    libraries
  • XDR for platform-independent data transfer

11
NetSolve Load Balancing
  • Assigning a task to the best machine
  • Establishing a performance model
  • Network delay, server properties, task properties
  • Measuring and monitoring dynamic system states
  • Load balancing at a finer granularity
  • Parallelism through non-blocking interface
  • Task migration

12
NetSolve Fault Tolerance
  • Inter-server fault tolerance
  • Fault tolerance among NetSolve servers
  • Intra-server fault tolerance
  • Fault tolerance within a NetSolve server

13
NetSolve Fault Tolerance Inter-server Fault
Tolerance
  • Performed by NetSolve agents
  • Basic approach
  • Failure detection task reallocation
  • Overload detection task migration
  • Introducing NetSolve storage servers
  • Store checkpoints or any information related to
    fault tolerance (must be platform-independent)
  • No reliance on failed or overloaded server for
    task migration

14
NetSolve Fault ToleranceIntra-server Fault
Tolerance
  • Not a new problem
  • Could be invisible to NetSolve
  • Can take advantage of platform-specific features
    for fault tolerance
  • Possible integration with inter-server fault
    tolerance

15
Diskless Checkpointing Checksums and Reverse
Computation
  • Diskless checkpointing eliminates the need for
    stable storage
  • N servers a checkpointing server
  • At any point, consistent checkpoints taken at N
    servers (stored in memory)
  • A checksum of checkpoints stored at the
    checkpointing server
  • Rollback using reverse computation
  • State recovery using the checksum

16
Applications
  • MCell with NetSolve
  • Large code, small data
  • Matlab with NetSolve
  • Tradeoffs between parallelism and overhead
  • IPARS with NetSolve
  • ImageVision with NetSolve

17
(No Transcript)
18
Integration with ScaLAPACK
19
Integration with Condor
20
Integration with Ninf
21
Conclusion
  • An interesting infrastructure for sharing
    computational resources
  • Both software and hardware
  • Convenience, performance, and reliability
  • Playground for fault tolerance
  • Both general and specific
Write a Comment
User Comments (0)
About PowerShow.com