The Kangaroo Approach to Data Movement on the Grid - PowerPoint PPT Presentation

1 / 66
About This Presentation
Title:

The Kangaroo Approach to Data Movement on the Grid

Description:

Those involving Kangaroo are sent to Adaptation layer. Otherwise, execute ... We have implemented a kangaroo testbed which has most of the critical features: ... – PowerPoint PPT presentation

Number of Views:51
Avg rating:3.0/5.0
Slides: 67
Provided by: dougl229
Category:

less

Transcript and Presenter's Notes

Title: The Kangaroo Approach to Data Movement on the Grid


1
The Kangaroo Approachto Data Movementon the Grid
  • Jim Basney, Miron Livny, Se-Chang Son, and
    Douglas Thain
  • Condor Project
  • University of Wisconsin

2
Outline
  • A Vision of Grid Data Movement
  • Architecture and Example
  • Semantics and Design
  • Necessary Mechanisms
  • The First Hop
  • What Next?

3
An Old Problem
  • Run programs that make use of CPUs and storage in
    separate locations.
  • There are basic, working solutions to this
    problem, but they do not address many of its
    subleties.

4
The Problem is Not Trivial
  • Distributed systems are subject to failures that
    most applications are not designed to handle.
  • Oops, a router died.
  • Oops, the switch is in half-duplex mode.
  • Oops, I forgot to start one server.
  • Oops, I forgot to update my AFS tokens.
  • We want to avoid wasting resources (cpu, network,
    disk) that charge for tenancy.
  • Co-allocation is a common solution, but external
    factors can get in the way.
  • Co-allocation in and of itself is wasteful!
  • Cant we overlap I/O and cpu?

5
Example
Compute Machines
Workstation
1000 Mb/s 1 ms
100 Mb/s 1 ms
10 Mb/s 100 ms
240 Mb/s 5 ms
6
Whats inOur Toolbox?
  • Partial File Transfer
  • Condor Remote I/O
  • Storage Resource Broker (SRB)
  • (NFS?)
  • Whole file transfer
  • Globus GASS
  • FTP, GridFTP
  • (AFS?)
  • Its not just what you move, but when you move it.

7
A Taxonomy ofExisting Systems
Data Movement Systems
Whole File
Get whole file at open, and write out at
close. Examples Globus GASS in app, AFS
8
Offline I/O
  • Benefits
  • Makes good throughput by pipelining.
  • Co-allocation of cpu and network not needed.
  • Easy to schedule.
  • Drawbacks
  • Must know needed files in advance.
  • Co-use of cpu and network not possible.
  • Must pull/push whole file, even when only partial
    is needed.

9
Online I/O
  • Benefits
  • Need not know I/O requirements up front. (Some
    programs compute file names.)
  • Gives user incremental results.
  • (Partial) Only moves what is actually used.
  • Drawbacks
  • Very difficult to schedule small or un-announced
    operations.
  • (Partial) Stop-and-wait does not scale to high
    latency networks.

10
Problems with Both
  • Error handling
  • GASS, AFS - close fails?!?
  • Condor - disconnect causes rollback
  • The longer the distance, the worse the
    performance
  • Drop rate is multiplied with each additional
    link.
  • Latency increases with each link.
  • TCP throughput is limited to the slowest link.
  • Resource allocation
  • Network allocation is done end-to-end.
  • CPU and I/O rarely overlap.

11
Our Vision
  • A no-futz wide-area data movement system that
    provides end-to-end reliability, maximizes
    throughput, and adapts to local conditions and
    policies.
  • Basic idea
  • Add buffers.
  • Add a process to oversee.

12
Our Vision
Compute Machines
Home Machine
1000 Mb/s 1 ms
100 Mb/s 1 ms
10 Mb/s 100 ms
300 Mb/s 5 ms
RAM
RAM
RAM
13
Our Vision A Grid
K
K
K
Data Movement System
K
K
K
K
14
Our Vision
  • Requirements
  • Must be fire-and-forget. Relieve the
    application of error handling! Robust wrt to
    machine and software crashes. (No-futz)
  • Must provide incremental output results.
  • Hide latency from applications by overlapping I/O
    and cpu.
  • Maximize use of resources (cpu, network, disk)
    when available, and evacuate same when required.

15
Our Vision
  • Concessions
  • No inter-process consistency needed.
  • Increased latency of actual data movement is
    acceptable.

16
The First Hop
  • A working test bed that validates the core
    architecture.
  • Supports applications using standard POSIX
    operations.
  • Concentrate on write-behind because it doesnt
    require speculation.
  • Leave room in the architecture to experiment with
    read-ahead.
  • Preview of results
  • Small scale, overlapping is slower.
  • Large scale, overlapping is faster.

17
Outline
  • A Vision of Grid Data Movement
  • Architecture and Example
  • Necessary Mechanisms
  • Semantics and Design
  • The First Hop
  • What Next?

18
Architecture
  • Layers
  • Application
  • Adaptation
  • Consistency
  • Transport
  • Example

19
Architecture
Application
File System
open, read, write, close, fsync
Adaptation
get, put, push, abort
open, read, write, close, fsync
Consistency
Consistency
put
ack
ack
put
ack
ack
Transport
Transport
Transport
put
put
20
Transport Layer
  • Interface
  • Send message, query route, query status
  • Semantics
  • Ordering - None (or worse!)
  • Reliability - Likely, but not guaranteed.
  • Duplication - Unlikely, but possible.
  • Performance
  • Uses all available resources (net, mem, disk) to
    maximize throughput.
  • Subject to local conditions (traffic, failures)
    and policies (priority, bw limits)

21
Transport Layer
In
Out
Transport
1 Gb/s
1 Gb/s
If output is blocked, then save input to disk
until it is full.
When output is ready again, read from disk,
memory, or input?
RAM
300 Mb/s
The freedom to reorder transported blocks may
allow us to improve throughput.
22
Consistency Layer
  • Interface
  • Get block, put block, sync file, abort file
  • Semantics
  • Ordering - Order preserving or not?
  • Reliability - Detects success
  • Duplication - Delivers at most once
  • Performance
  • Must cache dirty blocks until delivered
  • Might cache clean blocks
  • Might speculatively read clean blocks

23
Consistency Layer
Receiver Keeps records to enforce ordering and
supress duplicates.
Sender Keeps records to detect success, cache
writes.
Consistency
Consistency
Transport
Transport
Transport
24
Adaptation Layer
  • Converts POSIX operations into Kangaroo
    operations
  • Open
  • O_CREAT, always succeeds
  • Otherwise, checks for existence with a get
  • Read kangaroo get
  • Write kangaroo put
  • Close NOP
  • Fsync kangaroo sync

25
Example
Blocking procedure call
Non-blocking message
Application
File System
Adaptation
Consistency
Consistency
Transport
Transport
Transport
26
Outline
  • A Vision of Grid Data Movement
  • Architecture and Example
  • Semantics and Design
  • Necessary Mechanisms
  • The First Hop
  • What Next?

27
Semantics and Design
  • A data movement system is a bridge between file
    systems.
  • It addresses many of the same issues as file
    systems
  • Consistency
  • Committal
  • Ordering
  • Replication

28
Consistency
  • Single Node
  • A put/get blocks until the local server has
    atomically accepted it.
  • Multiple processes that are externally
    synchronized will see a consistent view.
  • Multiple Nodes
  • No guarantees unless you use an explicit sync.
  • This is reasonable in a Grid environment, because
    most users make use of a wide-area scheduler to
    partition jobs and data.

29
Commital
  • Possible meanings of commit
  • Force this data to the safest medium available.
  • Make these changes visible to others.
  • Make this data safe from a typical crash.
  • Possible implementations in Kangaroo
  • Push all the way to target, and force to disk
    (tape?)
  • Push to the target server.
  • Push to the nearest disk.

30
Commital
  • Safest choice is to implement the most
    conservative -- push all the way to the server,
    and force it to disk there.
  • Some applications may want the more relaxed
    meanings.
  • POSIX only provides one interface fsync().
  • Easy solution implement all three, and provide a
    flexible binding in the Adaptation layer.

31
Ordering
  • Does the system commit operations in the same
    order they were sent?
  • Relaxed -- no ordering
  • Satisifies large majority of apps that do not
    overlap writes.
  • Interesting case of output log files.
  • Need to wait max TTL before re-using an output
    file name
  • Strict -- exact ordering, enforced at recvr
  • Increases queue lengths everywhere.
  • Doesnt burden user with determining if
    application is safe to relax.

32
Strict Ordering Algorithm
  • Much like TCP
  • Sender keeps copies of data blocks until they are
    acknowledged.
  • Receiver sends cumulative acks and commits
    unbroken sequences.

33
Strict Ordering Algorithm
  • But some differences from TCP
  • No connection semantics.
  • Block ID is (birthday,sequence).
  • Receiver keeps on disk last ackd ID of all
    senders it has ever talked to.
  • If sender reboots
  • Compute the next ID from blocks on disk
  • If none, reset b to current time, s to 0
  • If receiver reboots
  • Last recvd ID of all senders is on disk.
  • Garbage problem fix with a long receiver timeout
    reset message causes sender to start over.

34
Replication Issues
  • We would like to delete data stored at the sender
    ASAP, but
  • Do I Trust this Disk?
  • Buffer Storage - Could disappear at any time.
  • Reliable Storage - No deliberate destruction.
  • Reliability is not everything
  • If delivery is highly likely and recomputation is
    relatively cheap, then losing data is acceptable
    but only if delivery failure is detectable!
  • Reliability More copies.
  • User should be able to configure a range from
    most reliable to fewest copies.

35
Replication Issues
  • End-to-End Argument
  • Regardless of whatever duplication is done
    internally for performance or reliability, only
    the end points can be responsible for ensuring
    (or detecting) correct delivery.
  • So, the sender must retain a record of what was
    sent, even if it does not retain the actual data.

36
Replication Techniques
  • Pass the Buck
  • Hold the Phone
  • Dont Trust Strangers

37
Pass the Buck
  • Delete the local copy after a one-hop ack.
    Requires atomic accept and sync. (Similar to
    email)

K
K
K
K
R
38
Hold the Phone
  • Sender keeps a copy of local data until the
    end-to-end ack is received. Midway hops need not
    immediately flush to disk.

K
K
K
K
D
R
39
Dont Trust Strangers
  • If the sender determines the receiver to be
    reliable, then delete, otherwise hold.

K
K
K
K
R
D
40
Replication Comparison
  • Pass the Buck
  • Evacuates source ASAP. One copy of data.
  • Dirty reads must hop through all nodes.
  • No retry of failures. (Success still likely.)
  • Hold the Phone
  • Evacuates source more slowly. Two copies.
  • Dirty reads always satisfied at source.
  • Sender can retry failures.
  • Dont Trust Strangers
  • Evacuates source like PTB, but still 2 copies.
  • Dirty reads hop.
  • Retries done midway.

41
Outline
  • A Vision of Grid Data Movement
  • Architecture and Example
  • Necessary Mechanisms
  • Semantics and Design
  • The First Hop
  • What Next?

42
Necessary Mechanisms
  • Adaptation Layer
  • Needs a tool for trapping and rerouting an
    applications I/O calls without special
    privileges Bypass
  • Transport Layer
  • Needs a tool for detecting network conditions and
    enforcing policies Cedar

43
Bypass
  • General-purpose tool for trapping and redirecting
    standard library procedures.
  • Trap all I/O operations. Those involving
    Kangaroo are sent to Adaptation layer.
    Otherwise, execute without modification.
  • Can be applied at run-time to any
    dynamically-linked program
  • vi kangaroo//home.cs.wisc.edu/tmp/file
  • grep thain gsiftp//ftp.cs.wisc.edu/etc/passwd
  • gcc http//www/example.c -o kangaroo//home/output

44
Cedar
  • Standard socket abstraction.
  • Enforces limits on how much bandwidth can be
    consumed across multiple times scales.
  • Also measures congestion and reports to
    locally-determined manager.
  • Example
  • If conditions are good, do not exceed 10Mb/s.
  • If there is competition for the link, fall back
    to no more than 1Mb/s.

45
Why Limit Bandwidth?
  • Isnt TCP flow control sufficient?
  • An overloaded receiver can squelch a sender with
    back-pressure.
  • Competing TCPs will tend to split the available
    bw equally.
  • No. Three reasons
  • To enforce local policies on resources consumed
    by visiting processes.
  • To clamp processes competing for a single
    resource.
  • To leave some bandwidth available for small-scale
    unscheduled operations.

46
Outline
  • A Vision of Grid Data Movement
  • Architecture and Example
  • Semantics and Design
  • Necessary Mechanisms
  • The First Hop
  • What Next?

47
The First Hop
  • We have implemented a kangaroo testbed which has
    most of the critical features
  • Each node runs a kangaroo_server process which
    accepts messages on TCP and UNIX-domain sockets.
  • Outgoing data is placed into a spool dir in the
    file system for a kangaroo_mover process to pick
    it up and send it out.
  • Bypass is used to attach unmodified UNIX
    applications to a libkangaroo.a which contacts
    the local server to execute puts and gets.

48
The First Hop
  • Several important elements are yet to be
    implemented
  • Only one sync algorithm
  • push to server but not to disk
  • Only one replication algorithm
  • hold the phone
  • Consistency layer detects delivery success, but
    does not timeout and retry.
  • Receiver implements only relaxed ordering.
  • Reads are implemented simply as minimal blocking
    RPCs to the target server.

49
Measurements
  • Micro How fast can an app write output?
  • Plain file
  • Plain file through Kangaroo
  • Kangaroo
  • Mini How fast can output be moved?
  • Online Stream from memory to network.
  • Offline Stage to disk, then write to network.
  • Kangaroo
  • Macro How fast can we run an event-processing
    program?
  • Online Read and write over network.
  • Offline Stage input, run program, stage output.
  • Kangaroo

50
Measurements
  • Two types of machines used
  • DiskgtNetwork (Linux Workstations)
  • 100 Mb/s switched Ethernet
  • 512 MB RAM
  • 10.2 GB Quantum Fireball Plus LM
  • Ultra ATA/66, 7200 RPM, 2MB cache
  • 650 MHz P3
  • NetworkgtDisk (Linux Cluster Nodes)
  • 100 Mb/s switched Ethernet
  • 1024 MB RAM
  • 9.1 GB IBM 08L8621
  • Ultra2 Wide SCSI-3, 10000 RPM, 4MB cache
  • 2 550 MHz P3 Xeon

51
(No Transcript)
52
(No Transcript)
53

54
(No Transcript)
55
(No Transcript)
56
(No Transcript)
57
(No Transcript)
58
(No Transcript)
59
MacrobenchmarkEvent Processing
  • A fair number of standard, but non-Grid-aware,
    applications look like this
  • For I1 to N
  • Read input
  • Compute results
  • Write output

60
MacrobenchmarkI/O Models
Offline I/O
IN
CPU
IN
OUTPUT
OUTPUT
CPU
CPU
OUTPUT
IN
Online I/O
IN
CPU
IN
OUTPUT
OUTPUT
CPU
CPU
OUTPUT
IN
Current Kangaroo
IN
CPU
IN
CPU
CPU
IN
OUTPUT
OUTPUT
OUTPUT
61
MacrobenchmarkEvent Processing
  • Synthetic Example
  • Ten loops of
  • 1 MB input
  • 15 seconds CPU
  • 100 MB output
  • Results on workstations
  • Offline 289 seconds (disk bound)
  • Online 249 seconds (network bound)
  • Kangaroo 183 seconds

62
Summary
  • Micro view Kangaroo imposes a severe penalty,
    due to additional memory copies and contention
    for disk and directory ops.
  • Mini view Kangaroo is competitive with staging
    and streaming, depending on the circumstances.
  • Macro view Kangaroo provides a big win when
    there is ample opportunity to overlap CPU and I/O.

63
Outline
  • A Vision of Grid Data Movement
  • Architecture and Example
  • Semantics and Design
  • Necessary Mechanisms
  • The First Hop
  • What Next?

64
Implementation Details
  • Error Reporting
  • Where is my data?
  • Acute failures should leave an error record that
    can be queried.
  • Chronic failures should trigger e-mail.
  • Strict Ordering
  • Read-Ahead

65
Research Issues
  • Prioritizing Reads over Writes
  • Easy to do at a single node.
  • Hard to synchronize between several.
  • Virtual Memory
  • Need a disk system optimized for read-once,
    write-once, delete-once.
  • Interaction with CPU scheduling
  • Long delay for input? Start another job.
  • Multi-Hop Staging
  • Probably a win for buffering between mismatched
    networks. Where is the boundary?

66
Conclusion
  • We have built a naïve implementation of Kangaroo
    using existing building blocks.
  • Despite its inefficiencies, the benefits of
    write-behind can be a big win.
  • Many open research issues!
Write a Comment
User Comments (0)
About PowerShow.com