The Kangaroo Approach to Data Movement on the Grid presentation

About This Presentation

Transcript and Presenter's Notes

Title: The Kangaroo Approach to Data Movement on the Grid

1
The Kangaroo Approachto Data Movementon the Grid

Jim Basney, Miron Livny, Se-Chang Son, and
Douglas Thain
Condor Project
University of Wisconsin

2
Outline

A Vision of Grid Data Movement
Architecture and Example
Semantics and Design
Necessary Mechanisms
The First Hop
What Next?

3
An Old Problem

Run programs that make use of CPUs and storage in
separate locations.
There are basic, working solutions to this
problem, but they do not address many of its
subleties.

4
The Problem is Not Trivial

Distributed systems are subject to failures that
most applications are not designed to handle.
Oops, a router died.
Oops, the switch is in half-duplex mode.
Oops, I forgot to start one server.
Oops, I forgot to update my AFS tokens.
We want to avoid wasting resources (cpu, network,
disk) that charge for tenancy.
Co-allocation is a common solution, but external
factors can get in the way.
Co-allocation in and of itself is wasteful!
Cant we overlap I/O and cpu?

5
Example
Compute Machines
Workstation
1000 Mb/s 1 ms
100 Mb/s 1 ms
10 Mb/s 100 ms
240 Mb/s 5 ms
6
Whats inOur Toolbox?

Partial File Transfer
Condor Remote I/O
Storage Resource Broker (SRB)
(NFS?)
Whole file transfer
Globus GASS
FTP, GridFTP
(AFS?)
Its not just what you move, but when you move it.

7
A Taxonomy ofExisting Systems
Data Movement Systems
Whole File
Get whole file at open, and write out at
close. Examples Globus GASS in app, AFS
8
Offline I/O

Benefits
Makes good throughput by pipelining.
Co-allocation of cpu and network not needed.
Easy to schedule.
Drawbacks
Must know needed files in advance.
Co-use of cpu and network not possible.
Must pull/push whole file, even when only partial
is needed.

9
Online I/O

Benefits
Need not know I/O requirements up front. (Some
programs compute file names.)
Gives user incremental results.
(Partial) Only moves what is actually used.
Drawbacks
Very difficult to schedule small or un-announced
operations.
(Partial) Stop-and-wait does not scale to high
latency networks.

10
Problems with Both

Error handling
GASS, AFS - close fails?!?
Condor - disconnect causes rollback
The longer the distance, the worse the
performance
Drop rate is multiplied with each additional
link.
Latency increases with each link.
TCP throughput is limited to the slowest link.
Resource allocation
Network allocation is done end-to-end.
CPU and I/O rarely overlap.

11
Our Vision

A no-futz wide-area data movement system that
provides end-to-end reliability, maximizes
throughput, and adapts to local conditions and
policies.
Basic idea
Add buffers.
Add a process to oversee.

12
Our Vision
Compute Machines
Home Machine
1000 Mb/s 1 ms
100 Mb/s 1 ms
10 Mb/s 100 ms
300 Mb/s 5 ms
RAM
RAM
RAM
13
Our Vision A Grid
K
K
K
Data Movement System
K
K
K
K
14
Our Vision

Requirements
Must be fire-and-forget. Relieve the
application of error handling! Robust wrt to
machine and software crashes. (No-futz)
Must provide incremental output results.
Hide latency from applications by overlapping I/O
and cpu.
Maximize use of resources (cpu, network, disk)
when available, and evacuate same when required.

15
Our Vision

Concessions
No inter-process consistency needed.
Increased latency of actual data movement is
acceptable.

16
The First Hop

A working test bed that validates the core
architecture.
Supports applications using standard POSIX
operations.
Concentrate on write-behind because it doesnt
require speculation.
Leave room in the architecture to experiment with
read-ahead.
Preview of results
Small scale, overlapping is slower.
Large scale, overlapping is faster.

17
Outline

A Vision of Grid Data Movement
Architecture and Example
Necessary Mechanisms
Semantics and Design
The First Hop
What Next?

18
Architecture

Layers
Application
Adaptation
Consistency
Transport
Example

19
Architecture
Application
File System
open, read, write, close, fsync
Adaptation
get, put, push, abort
open, read, write, close, fsync
Consistency
Consistency
put
ack
ack
put
ack
ack
Transport
Transport
Transport
put
put
20
Transport Layer

Interface
Send message, query route, query status
Semantics
Ordering - None (or worse!)
Reliability - Likely, but not guaranteed.
Duplication - Unlikely, but possible.
Performance
Uses all available resources (net, mem, disk) to
maximize throughput.
Subject to local conditions (traffic, failures)
and policies (priority, bw limits)

21
Transport Layer
In
Out
Transport
1 Gb/s
1 Gb/s
If output is blocked, then save input to disk
until it is full.
When output is ready again, read from disk,
memory, or input?
RAM
300 Mb/s
The freedom to reorder transported blocks may
allow us to improve throughput.
22
Consistency Layer

Interface
Get block, put block, sync file, abort file
Semantics
Ordering - Order preserving or not?
Reliability - Detects success
Duplication - Delivers at most once
Performance
Must cache dirty blocks until delivered
Might cache clean blocks
Might speculatively read clean blocks

23
Consistency Layer
Receiver Keeps records to enforce ordering and
supress duplicates.
Sender Keeps records to detect success, cache
writes.
Consistency
Consistency
Transport
Transport
Transport
24
Adaptation Layer

Converts POSIX operations into Kangaroo
operations
Open
O_CREAT, always succeeds
Otherwise, checks for existence with a get
Read kangaroo get
Write kangaroo put
Close NOP
Fsync kangaroo sync

25
Example
Blocking procedure call
Non-blocking message
Application
File System
Adaptation
Consistency
Consistency
Transport
Transport
Transport
26
Outline

A Vision of Grid Data Movement
Architecture and Example
Semantics and Design
Necessary Mechanisms
The First Hop
What Next?

27
Semantics and Design

A data movement system is a bridge between file
systems.
It addresses many of the same issues as file
systems
Consistency
Committal
Ordering
Replication

28
Consistency

Single Node
A put/get blocks until the local server has
atomically accepted it.
Multiple processes that are externally
synchronized will see a consistent view.
Multiple Nodes
No guarantees unless you use an explicit sync.
This is reasonable in a Grid environment, because
most users make use of a wide-area scheduler to
partition jobs and data.

29
Commital

Possible meanings of commit
Force this data to the safest medium available.
Make these changes visible to others.
Make this data safe from a typical crash.
Possible implementations in Kangaroo
Push all the way to target, and force to disk
(tape?)
Push to the target server.
Push to the nearest disk.

30
Commital

Safest choice is to implement the most
conservative -- push all the way to the server,
and force it to disk there.
Some applications may want the more relaxed
meanings.
POSIX only provides one interface fsync().
Easy solution implement all three, and provide a
flexible binding in the Adaptation layer.

31
Ordering

Does the system commit operations in the same
order they were sent?
Relaxed -- no ordering
Satisifies large majority of apps that do not
overlap writes.
Interesting case of output log files.
Need to wait max TTL before re-using an output
file name
Strict -- exact ordering, enforced at recvr
Increases queue lengths everywhere.
Doesnt burden user with determining if
application is safe to relax.

32
Strict Ordering Algorithm

Much like TCP
Sender keeps copies of data blocks until they are
acknowledged.
Receiver sends cumulative acks and commits
unbroken sequences.

33
Strict Ordering Algorithm

But some differences from TCP
No connection semantics.
Block ID is (birthday,sequence).
Receiver keeps on disk last ackd ID of all
senders it has ever talked to.
If sender reboots
Compute the next ID from blocks on disk
If none, reset b to current time, s to 0
If receiver reboots
Last recvd ID of all senders is on disk.
Garbage problem fix with a long receiver timeout
reset message causes sender to start over.

34
Replication Issues

We would like to delete data stored at the sender
ASAP, but
Do I Trust this Disk?
Buffer Storage - Could disappear at any time.
Reliable Storage - No deliberate destruction.
Reliability is not everything
If delivery is highly likely and recomputation is
relatively cheap, then losing data is acceptable
but only if delivery failure is detectable!
Reliability More copies.
User should be able to configure a range from
most reliable to fewest copies.

35
Replication Issues

End-to-End Argument
Regardless of whatever duplication is done
internally for performance or reliability, only
the end points can be responsible for ensuring
(or detecting) correct delivery.
So, the sender must retain a record of what was
sent, even if it does not retain the actual data.

36
Replication Techniques

Pass the Buck
Hold the Phone
Dont Trust Strangers

37
Pass the Buck

Delete the local copy after a one-hop ack.
Requires atomic accept and sync. (Similar to
email)

K
K
K
K
R
38
Hold the Phone

Sender keeps a copy of local data until the
end-to-end ack is received. Midway hops need not
immediately flush to disk.

K
K
K
K
D
R
39
Dont Trust Strangers

If the sender determines the receiver to be
reliable, then delete, otherwise hold.

K
K
K
K
R
D
40
Replication Comparison

Pass the Buck
Evacuates source ASAP. One copy of data.
Dirty reads must hop through all nodes.
No retry of failures. (Success still likely.)
Hold the Phone
Evacuates source more slowly. Two copies.
Dirty reads always satisfied at source.
Sender can retry failures.
Dont Trust Strangers
Evacuates source like PTB, but still 2 copies.
Dirty reads hop.
Retries done midway.

41
Outline

A Vision of Grid Data Movement
Architecture and Example
Necessary Mechanisms
Semantics and Design
The First Hop
What Next?

42
Necessary Mechanisms

Adaptation Layer
Needs a tool for trapping and rerouting an
applications I/O calls without special
privileges Bypass
Transport Layer
Needs a tool for detecting network conditions and
enforcing policies Cedar

43
Bypass

General-purpose tool for trapping and redirecting
standard library procedures.
Trap all I/O operations. Those involving
Kangaroo are sent to Adaptation layer.
Otherwise, execute without modification.
Can be applied at run-time to any
dynamically-linked program
vi kangaroo//home.cs.wisc.edu/tmp/file
grep thain gsiftp//ftp.cs.wisc.edu/etc/passwd
gcc http//www/example.c -o kangaroo//home/output

44
Cedar

Standard socket abstraction.
Enforces limits on how much bandwidth can be
consumed across multiple times scales.
Also measures congestion and reports to
locally-determined manager.
Example
If conditions are good, do not exceed 10Mb/s.
If there is competition for the link, fall back
to no more than 1Mb/s.

45
Why Limit Bandwidth?

Isnt TCP flow control sufficient?
An overloaded receiver can squelch a sender with
back-pressure.
Competing TCPs will tend to split the available
bw equally.
No. Three reasons
To enforce local policies on resources consumed
by visiting processes.
To clamp processes competing for a single
resource.
To leave some bandwidth available for small-scale
unscheduled operations.

46
Outline

A Vision of Grid Data Movement
Architecture and Example
Semantics and Design
Necessary Mechanisms
The First Hop
What Next?

47
The First Hop

We have implemented a kangaroo testbed which has
most of the critical features
Each node runs a kangaroo_server process which
accepts messages on TCP and UNIX-domain sockets.
Outgoing data is placed into a spool dir in the
file system for a kangaroo_mover process to pick
it up and send it out.
Bypass is used to attach unmodified UNIX
applications to a libkangaroo.a which contacts
the local server to execute puts and gets.

48
The First Hop

Several important elements are yet to be
implemented
Only one sync algorithm
push to server but not to disk
Only one replication algorithm
hold the phone
Consistency layer detects delivery success, but
does not timeout and retry.
Receiver implements only relaxed ordering.
Reads are implemented simply as minimal blocking
RPCs to the target server.

49
Measurements

Micro How fast can an app write output?
Plain file
Plain file through Kangaroo
Kangaroo
Mini How fast can output be moved?
Online Stream from memory to network.
Offline Stage to disk, then write to network.
Kangaroo
Macro How fast can we run an event-processing
program?
Online Read and write over network.
Offline Stage input, run program, stage output.
Kangaroo

50
Measurements

Two types of machines used
DiskgtNetwork (Linux Workstations)
100 Mb/s switched Ethernet
512 MB RAM
10.2 GB Quantum Fireball Plus LM
Ultra ATA/66, 7200 RPM, 2MB cache
650 MHz P3
NetworkgtDisk (Linux Cluster Nodes)
100 Mb/s switched Ethernet
1024 MB RAM
9.1 GB IBM 08L8621
Ultra2 Wide SCSI-3, 10000 RPM, 4MB cache
2 550 MHz P3 Xeon

51
(No Transcript)
52
(No Transcript)
53

54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
MacrobenchmarkEvent Processing

A fair number of standard, but non-Grid-aware,
applications look like this
For I1 to N
Read input
Compute results
Write output

60
MacrobenchmarkI/O Models
Offline I/O
IN
CPU
IN
OUTPUT
OUTPUT
CPU
CPU
OUTPUT
IN
Online I/O
IN
CPU
IN
OUTPUT
OUTPUT
CPU
CPU
OUTPUT
IN
Current Kangaroo
IN
CPU
IN
CPU
CPU
IN
OUTPUT
OUTPUT
OUTPUT
61
MacrobenchmarkEvent Processing

Synthetic Example
Ten loops of
1 MB input
15 seconds CPU
100 MB output
Results on workstations
Offline 289 seconds (disk bound)
Online 249 seconds (network bound)
Kangaroo 183 seconds

62
Summary

Micro view Kangaroo imposes a severe penalty,
due to additional memory copies and contention
for disk and directory ops.
Mini view Kangaroo is competitive with staging
and streaming, depending on the circumstances.
Macro view Kangaroo provides a big win when
there is ample opportunity to overlap CPU and I/O.

63
Outline

A Vision of Grid Data Movement
Architecture and Example
Semantics and Design
Necessary Mechanisms
The First Hop
What Next?

64
Implementation Details

Error Reporting
Where is my data?
Acute failures should leave an error record that
can be queried.
Chronic failures should trigger e-mail.
Strict Ordering
Read-Ahead

65
Research Issues

Prioritizing Reads over Writes
Easy to do at a single node.
Hard to synchronize between several.
Virtual Memory
Need a disk system optimized for read-once,
write-once, delete-once.
Interaction with CPU scheduling
Long delay for input? Start another job.
Multi-Hop Staging
Probably a win for buffering between mismatched
networks. Where is the boundary?

66
Conclusion

We have built a naïve implementation of Kangaroo
using existing building blocks.
Despite its inefficiencies, the benefits of
write-behind can be a big win.
Many open research issues!

Write a Comment

User Comments (0)

About PowerShow.com

The Kangaroo Approach to Data Movement on the Grid PowerPoint PPT Presentation