Title: Parrot:%20Transparent%20User-Level%20Middleware%20for%20Data-Intensive%20Computing
1ParrotTransparent User-Level Middlewarefor
Data-Intensive Computing
- Douglas Thain
- Condor Project, University of Wisconsin
- Workshop on Adaptive Grid Middleware
- 28 September 2003
2The Reality of the Grid
afwuhweiuhsdvxmndf (and then a miracle
happens) PNP
I think you have a problem here...
Look at my new proof!
3Condor
PBS
NQE
LSF
Load Leveler
run this batch job
Local Operating System
Process Interface (main, exit, abort, kill, sleep)
Users App
Parrot
4Applications of Parrot
- Interactive Browsing
- tcsh, tar, gzip, make, acroread, gv, xv...
- Improved Reliability
- Transparent retry/reassignment/reallocation
- Files, sockets, even repair broken apps.
- Private Namespaces
- Make /home/thain appear the same everywhere.
- Make /usr/data/calibration different everywhere.
- Dynamic/Distributed Program Construction
- Remote link, remote exec, remote eval...
- Profiling and Debugging
- Users may not know low-level I/O patterns.
5Challenges
- Technical Methods of Interposition
- Semantic Differences
- Error Management
- CPU I/O Integration
- Performance
- The butterfly effect
- Subtle underlying differences can have large
effects in performance and usability.
6Internal Techniques
Binary Rewriting
Polymorphic Extension
App Code
App Code
Standard Library
Library
M1
M2
NEW
New Code
App Code
New Library
Standard Library
Static or Dynamic Re-Linking
7External Techniques
Debugger Trap
Remote Filesystem
App
Agent
App
Kernel
Kernel Callout
Kernel
Agent
App
NFS
LFS
FFS
NFS
LFS
FFS
agent
Kernel
NFS
LFS
USR
8Techniques Compared
technique burden speed hole detection
polymorphic rewrite fast easy
static link relink fast hard
dynamic link dynlink medium hard
binary rewrite dynlink fast hard
remote fs root varies easy
callout root slow easy
debugger none very slow easy
9Hole Detection Matters
- Dynamic Linking
- Bypass Toolkit, ca. 2000
- Works with some standard tools.
- Many still crash in strange ways.
- Doesnt apply to static exes always a surprise.
- Debugger Trap
- Parrot Coding began in May of 2003.
- Works reliably with almost everything in
/usr/bin. - Caveat 1 Twice as much code
- Caveat 2 Higher latency
10Debugger Trap
- For the rest of this talk, we select the debugger
trap for completeness and reliability. Much of
the discussion still applies to the other
techniques too. - Some technical details in the paper
- Only on Linux.
- Must manage process ancestry.
- Must fudge some broken ptrace behavior.
- Cannot write directly to process, must take
roundabout path through temp file.
11User Process
SYS_write
SYS_read
SYS_open
(debugger trap)
parrot_read
parrot_open
parrot_write
File Descr.
0
1
2
3
4
5
6
7
8
9
...
name resolver
File Pointers
pos 100
pos 0
pos 0
pos 1 MB
pos 42
mount list driver
chirp lookup driver
File Objects
outfile
infile
config
data
Local Driver
Chirp Driver
FTP Driver
NeST Driver
RFIO Driver
DCAP Driver
Device Drivers
12Adaptation
On same host
/mydata -gt /usr/data
App
open(/mydata/foo)
Parrot
Local
FTP
Chirp
/usr/data
13What Protocol?
- File Transfer Protocol
- Internet standard, many implementations.
- High bandwidth sequential access.
- NeST
- General purpose storage appliance from UW.
- Virtual users, namespace, and allocation.
- RFIO
- Remote I/O protocol used with CERN CASTOR.
- UNIX like, most ops require a new TCP.
- DCAP
- Remote I/O protocol used with Fermi D-Cache
- UNIX like, WORM semantics, no directories,
caching/ - Chirp
- Protocol developed _at_ UW for Parrot.
- Corresponds very closely to UNIX, incl errnos.
14Small Details Matter
- Standard tools need to know subtle details,
otherwise, they break - ls lR performs getdents(foo)
- on success descend
- on ENOTDIR display and continue
- on ENOENT display error and stop.
- FTP does not provide this detail
- Failed LIST -gt error 550
- Failed GET -gt error 550
- Failed CDIR -gt error 550
- Simple assignment doesnt work
- Making 550ENOENT breaks many tools.
15Example Solution
LIST foo
200
Success
other
550
CWD foo
Transient Error
550
other
Not a dir.
200
SIZE foo
other
200
Access denied.
No such entry.
550
16CPU-IO Integration
- Errors that cannot be expressed in the clients
interface must be passed to a higher level (the
batch system.) - Simple options
- kill 9 application (retry app elsewhere)
- exit(1) application (dont retry app)
- Complex options (Condor only)
- restart with (Subnet!128.101.175)
- restart with (CurrentTimegt5pm)
17Bandwidth by Protocol
18Latency by Protocol (ms)
stat open close read 1B read 8KB write 1B write 8KB
chirp 0.50 0.84 0.61 2.80 0.38 2.23
ftp 0.87 2.82 - - - -
nest 2.51 2.53 2.96 4.48 5.53 7.41
rfio 13.41 23.11 0.50 3.32 39.8 2.85
dcap 152.53 159.09 40.05 3.01 40.14 3.14
19Andrew-Like Benchmark
- Original Andrew benchmark is no longer
appropriate, so replace with the Parrot source
296 files, 955 KB. - Copy the source to a remote device, then
manipulate in five stages - copy cp rp
- list ls lR
- scan grep searchstring r
- make make
- delete rm rf
20Overheads Compared
21Overheads Compared
22Protocols Compared
23Protocols Compared
24Moral of the story
- The butterfly effect Small underlying
differences can have big effects on performance
and reliability. - Examples in interposition
- Dynamic linking fast but poor hole detection.
- Debugger trap slow but good hold detection.
- Examples in protocols
- Chirp UNIX semantics restrict bandwidth.
- FTP Need for multiple ops increases latency.
- NeST Powerful virtualization increases latency.
- RFIO Connection per op doesnt scale.
25For more info...
- Douglas Thain
- thain_at_cs.wisc.edu
- Miron Livny
- miron_at_cs.wisc.edu
- Software, manuals, more info
- http//www.cs.wisc.edu/condor/parrot
- The Condor Project
- http//www.cs.wisc.edu/condor