Title: NOW and Beyond
1NOW and Beyond
- Workshop on Clusters and Computational Grids for
Scientific Computing - David E. Culler
- Computer Science Division
- Univ. of California, Berkeley
- http//now.cs.berkeley.edu/
2NOW Project Goals
- Make a fundamental change in how we design and
construct large-scale systems - market reality
- 50/year performance growth gt cannot allow 1-2
year engineering lag - technological opportunity
- single-chip Killer Switch gt fast, scalable
communication - Highly integrated building-wide system
- Explore novel system design concepts in this new
cluster paradigm
3Berkeley NOW
- 100 Sun UltraSparcs
- 200 disks
- Myrinet SAN
- 160 MB/s
- Fast comm.
- AM, MPI, ...
- Ether/ATM switched external net
- Global OS
- Self Config
4Landmarks
- Top 500 Linpack Performance List
- MPI, NPB performance on par with MPPs
- RSA 40-bit Key challenge
- World Leading External Sort
- Inktomi search engine
- NPACI resource site
5Taking Stock
- Surprising successes
- virtual networks
- implicit co-scheduling
- reactive IO
- service-based applications
- automatic network mapping
- Surprising unsuccesses
- global system layer
- xFS file system
- New directions for Millennium
- Paranoid construction
- Computational Economy
- Smart Clients
6Fast Communication
- Fast communication on clusters is obtained
through direct access to the network, as on MPPs - Challenge is make this general purpose
- system implementation should not dictate how it
can be used
7Virtual Networks
- Endpoint abstracts the notion of attached to the
network - Virtual network is a collection of endpoints that
can name each other. - Many processes on a node can each have many
endpoints, each with own protection domain.
8How are they managed?
- How do you get direct hardware access for
performance with a large space of logical
resources? - Just like virtual memory
- active portion of large logical space is bound to
physical resources
Host Memory
Process n
Processor
Process 3
Process 2
Process 1
NIC Mem
P
Network Interface
9Network Interface Support
- NIC has endpoint frames
- Services active endpoints
- Signals misses to driver
- using a system endpont
Frame 0
Transmit
Receive
Frame 7
EndPoint Miss
10Communication under Load
gt Use of networking resources adapts to demand.
11Implicit Coscheduling
- Problem parallel programs designed to run in
parallel gt huge slowdowns with local scheduling - gang scheduling is rigid, fault prone, and
complex - Coordinate schedulers implicitly using the
communication in the program - very easy to build, robust to component failures
- inherently service on-demand, scalable
- Local service component can evolve.
12Why it works
- Infer non-local state from local observations
- React to maintain coordination
- observation implication action
- fast response partner scheduled spin
- delayed response partner not scheduled block
13Example
- Range of granularity and load imbalance
- spin wait 10x slowdown
14I/O Lessons from NOW sort
- Complete system on every node powerful basis for
data intensive computing - complete disk sub-system
- independent file systems
- MMAP not read, MADVISE
- full OS gt threads
- Remote I/O (with fast comm.) provides same
bandwidth as local I/O. - I/O performance is very tempermental
- variations in disk speeds
- variations within a disk
- variations in processing, interrupts, messaging,
...
15Reactive I/O
- Loosen data semantics
- ex unordered bag of records
- Build flows from producers (eg. Disks) to
consumers (eg. Summation) - Flow data to where it can be consumed
Adaptive Parallel Aggregation
Static Parallel Aggregation
16Performance Scaling
- Allows more data to go to faster consumer
17Service Based Applications
Transcend Transcoding Proxy
Service request
Front-end service threads
User Profile Database
Manager
Physical processor
Caches
- Application provides services to clients
- Grows/Shrinks according to demand, availability,
and faults
18On the other hand
- Glunix
- offered much that was not available elsewhere
- interactive use, load balancing, transparency
(partial), - straightforward master-slaves architecture
- millions of jobs served, reasonable scalability,
flexible partitioning - crash-prone, inscrutable, unaware,
- xFS
- very sophisticated co-operative caching network
RAID - integrated at vnode layer
- never robust enough for real use
- Both are hard, outstanding problems
19Lessons
- Strength of clusters comes from
- complete, independent components
- incremental scalability (up and down)
- nodal isolation
- Performance heterogeneity and change are
fundamental - Subsystems and applications need to be reactive
and self-tuning - Local intelligence simple, flexible composition
20Millennium
- Campus-wide cluster of clusters
- PC based (Solaris/x86 and NT)
- Distributed ownership and control
- Computational science and internet systems testbed
21Paranoid Construction
- What must work for RSH, dCOM, RMI, read, ?
- A page of C to safely read a line from a socket!
- gt carefully controlled set of cluster system
ops - gt non-blocking with timeout and full error
checking - even if need a watcher thread
- gt optimistic with fail-over of implementation
- gt global capability at physical level
- gt indirection used for transparency must track
fault envelope, not just provide mapping
22Computational Economy Approach
- System has a supply of various resources
- Demand for resources revealed in price
- distinct from the cost of acquiring the resources
- User has unique assessment of value
- Client agent negotiates for system resources on
users behalf - submits requests, receives bids or participates
in auctions - selects resources of highest value at least cost
23Advantages of the Approach
- Decentralized load balancing
- according to users perception of importance, not
systems - adapts to system and workload changes
- Creates Incentive to adopt efficient modes of use
- maintain resources in usable form
- avoid excessive usage when needed by others
- exploit under-utilized resources
- maximize flexibility (e.g., migratable,
restartable applications) - Establishes user-to-user feedback on resource
usage - basis for exchange rate across resources
- Powerful framework for system design
- Natural for client to be watchful, proactive, and
wary - Generalizes from resources to services
- Rich body of theory ready for application
24Resource Allocation
Stream of (partial, delayed, or
incomplete) resource status information
Stream of (incomplete) Client Requests
Allocator
- Traditional approach allocates requests to
resources to optimize some system utility
function - e.g., put work on least loaded, most free mem,
short queue, ... - Economic approach views each user as having a
distinct utility function - e.g., can exchange resource and have both happy!
25Pricing and all that
- Whats the value of a CPU-minute, a MB-sec, a
GB-day? - Many iterative market schemes
- raise price till load drops
- Auctions avoid setting a price
- Vikrey (second price sealed bid) will cause
resources to go to where they are most valued at
the lowest price - In self-interest to reveal true utility function!
- Small problem auctions are awkward for most real
allocation problems - Big problem people (and their surrogates) dont
know what value to place on computation and
storage!
26Smart Clients
- Adopt the NT everything is two-tier, at least
- UI stays on the desktop and interacts with
computation in the cluster of clusters via
distributed objects - Single-system image provided by wrapper
- Client can provide complete functionality
- resource discovery, load balancing
- request remote execution service
- Flexible applns will monitor availability and
adapt. - Higher level services 3-tier optimization
- directory service, membership, parallel startup
27Everything is a service
- Load-balancing
- Brokering
- Replication
- Directories
- gt they need to be cost-effective or client will
fall back to self support - if they are cost-effective, competitors might
arise - Useful applications should be packaged as
services - their value may be greater than the cost of
resources consumed
28Conclusions
- Weve got the building blocks for very
interesting clustered systems - fast communication, authentication, directories,
distributed object models - Transparency and uniform access are convenient,
but... - It is time to focus on exploiting the new
characteristics of these systems in novel ways. - We need to get real serious about availability.
- Agility (wary, reactive, adaptive) is
fundamental. - Gronky F77 MPI and no IO codes will seriously
hold us back - Need to provide a better framework for cluster
applications