Research Issues in Cooperative Computing - PowerPoint PPT Presentation

About This Presentation
Title:

Research Issues in Cooperative Computing

Description:

Research Issues in Cooperative Computing Douglas Thain http://www.cse.nd.edu/~ccl – PowerPoint PPT presentation

Number of Views:156
Avg rating:3.0/5.0
Slides: 28
Provided by: Dougla261
Learn more at: https://www3.nd.edu
Category:

less

Transcript and Presenter's Notes

Title: Research Issues in Cooperative Computing


1
Research Issues inCooperative Computing
  • Douglas Thain
  • http//www.cse.nd.edu/ccl

2
Sharing is Hard!
  • Despite decades of research in distributed
    systems and operating systems, sharing computing
    resources is still technically and socially
    difficult!
  • Most existing systems for sharing require
  • Kernel level software.
  • A privileged login.
  • Centralized trust.
  • Loss of control over resources that you own.

3
Cooperative Computing Credo
  • Lets create tools and systems that make it easy
    for users to cooperate (or be selfish) as they
    see fit.
  • Modus operandi
  • Make tools that are foolproof enough for casual
    use by one or two people in the office.
  • If they really are foolproof, then they will also
    be suitable for deployment in large scale systems
    such as computational grids.

4
I need ten more CPUs in order to finish my paper
by Friday!
CSE grads can computehere, but only when Im not.
May I use your CPUs?
CPU
CPU
CPU
CPU
CPU
CPU
Is this person a CSE grad?
My friends in Italy need to access this data.
auth server
Im not root!
secure I/O
disk
disk
disk
PBs of workstation storage! Can I use this as a
cache?
If I can backup to you, you can backup to me.
disk
disk
5
Storage is a Funny Resource
  • Rebuttal Storage is large and practically
    free!
  • TB -gt PB is not free to install or manage.
  • But, it comes almost accidentally with CPUs.
  • Aggressive replication (caching) can fill it
    quickly.
  • Storage has unusual properties
  • Locality Space needs to be near computation.
  • Non-locality Redundant copies must be separated.
  • Transfer is very expensive compared to
    reservation.
  • i.e. Dont waste an hour transferring unless it
    will succeed!
  • Managing storage is different than managing data.
  • All of this gets worse on the grid.

6
On the Grid
  • Quick intro to grid computing
  • The vision Let us make large-scale computing
    resources as reliable and as accessible as the
    electric power grid or the public water utility.
  • The audience Scientists with grand challenge
    problems that require unlimited amounts of
    computing power. More computation Better
    results.
  • The reality today Tie together computing
    clusters and archival storage around the country
    into systems that are (almost) usable by experts.

7
On the Grid
job
job
job
job
job
job
job
job
Work Queue
8
Grid Computing Experience
  • Ian Foster, et al. (102 authors)
  • The Grid2003 Production Grid
  • Principles and Practice
  • IEEE HPDC 2004
  • The Grid2003 Project has deployed a multi-virtual
    organization, application-driven grid laboratory
    that has sustained for several months the
    production-level services required by
  • ATLAS, CMS, SDSS, LIGO

9
Grid Computing Experience
  • The good news
  • 27 sites with 2800 CPUs
  • 40985 CPU-days provided over 6 months
  • 10 applications with 1300 simultaneous jobs
  • The bad news
  • 40-70 percent utilization
  • 30 percent of jobs would fail
  • 90 percent of failures were site problems
  • Most site failures were due to disk space.

10
Storage Matters
  • All of these environments
  • Office Server Room Grid Computing
  • Require storage to be an allocable, shareable,
    accountable resource.
  • We need new tools to accomplish this.

11
What are the Common Problems?
  • Local Autonomy
  • Resource Heterogeneity
  • Complex Access Control
  • Multiple Users
  • Competition for Resources
  • Low Reliability
  • Complex Debugging

12
Vision of Cooperative Storage
  • Make it easy to deploy systems that
  • Allow sharing of storage space.
  • Respect existing human structures.
  • Provide reasonable space/perf promises.
  • Work easily and transparently without root.
  • Make the non-ideal properties manageable
  • Limited allocation. (select, renew,
    migrate)
  • Unreliable networks. (useful fallback
    modes)
  • Changing configuration. (auto. discovery/config)

13
basic filesystem
14
The Current Situation
chirp server
chirp server
chirp server
chirp server
chirp server
15
Demo Time!
16
Research Issues
storage server
storage server
storage server
Collective Resource Management
Coordinated CPU-I/O Distributed Debugging
Single Resource Management
storage server
Space Allocation Dist Access Control
Operating Systems Design
storage device
operating system
Visiting Principals Allocation in FS
storage server
17
Space Allocation
  • Simple implementation
  • Like quotas, keep a flat lookaside database.
  • Update db on each write, or just periodically.
  • To recover, re-scan entire filesystem.
  • Not scalable to large FS or many allocations.
  • Better implementation
  • Keep alloc info hierarchically in the FS.
  • To recover, re-scan only the dirty subtrees.
  • A combination of a FS and hierarchical DB.
  • User representation?

18
Distributed Access Control
  • Things I cant do today
  • Give access rights to any CSE grad student on my
    local (non-AFS) filesystems.
  • (Where Dr. Madey makes the list each semester.)
  • Allow members of my conference committee to share
    my storage space in AFS.
  • (Where I maintain the membership list.)
  • Give read access to a valuable data repository to
    all faculty at Notre Dame and all members of a
    DHS Biometrics analysis program.
  • (Where each list is kept elsewhere in the
    country.)

19
Distributed Access Control
  • What will this require?
  • Separation of ACL services from filesystems.
  • Simple administrative tools.
  • Semantics for dealing with failure.
  • Issues of security and privacy of access lists.
  • Isnt this a solved problem?
  • Not for multiple large-scale organizations.
  • Not for varying degrees of trust and timeliness.
  • (ACLs were still a research issue in SOSP 2003.)
  • The end result
  • A highly-specialized distributed database. (DNS)

20
Nested Principals
  • How do we represent visiting users?
  • Let visitors use my uid.
  • Let visitors use nobody (root)
  • Create a new temporary uid. (root)
  • Sandbox user and audit every action. (complex)
  • Simple Idea Let users create sub-principals.
  • root -gt rootdthain
  • rootdthain -gt rootdthainafriend
  • The devil is in the details
  • Semantic issues superiority, equivalence
  • Implementation issues AAA, filesystem,
    persistence
  • Philosophical issues capabilities vs ACLs

21
Coordinated CPU and I/O
  • We usually think of a cluster as
  • N CPUs disks to install the OS on.
  • Use local disks as cache for primary server.
  • Not smart for data-bound applications.
  • (As CPUs get faster, everyone is data bound!)
  • Alternate conception
  • Cluster Storage device with inline CPUs.
  • Molasses System
  • Minimize movement of jobs
  • and/or the data they consume
  • Large-scale PIM!
  • Perfect for data exploration.

CPU
CPU
CPU
job
job
job
22
Coordinated CPU and I/O
  • We usually think of a cluster as
  • N CPUs disks to install the OS on.
  • Use local disks as cache for primary server.
  • Not smart for data-bound applications.
  • (As CPUs get faster, everyone is data bound!)
  • Alternate conception
  • Cluster Storage device with inline CPUs.
  • Molasses System
  • Minimize movement of jobs
  • and/or the data they consume
  • Large-scale PIM!
  • Perfect for data exploration.

?
?
?
?
CPU
CPU
CPU
!
storage server
storage server
storage server
data
data
data
23
Distributed Debugging
debugger
kerberos
cpu
cpu
batch system
auth gateway
workload manager
cpu
cpu
cpu
cpu
job
log file
log file
log file
license manager
archival host
storage Server
storage server
storage server
log file
log file
log file
log file
log file
24
Distributed Debugging
  • Big challenges!
  • Language issues storing and combining logs.
  • Ordering How to reassemble events?
  • Completeness Gaps, losses, detail.
  • Systems Distributed data collection.
  • But, could be a big win
  • A crashes whenever X gets its creds from Y.
  • Please try again I have turned up the detail on
    host B.

25
Research Issues
storage server
storage server
storage server
Collective Resource Management
Coordinated CPU-I/O Distributed Debugging
Single Resource Management
storage server
Space Allocation Dist Access Control
Operating Systems Design
storage device
operating system
Visiting Principals Allocation in FS
storage server
26
Motto
You must build it and use it in order to
understand it!
27
For more information
  • Software, systems, papers, etc
  • The Cooperative Computing Lab
  • http//www.cse.nd.edu/ccl
  • Or stop by to chat
  • Douglas Thain
  • 356-D Fitzpatrick
  • dthain_at_cse.nd.edu
Write a Comment
User Comments (0)
About PowerShow.com