Systems Seminar Schedule - PowerPoint PPT Presentation

1 / 44
About This Presentation
Title:

Systems Seminar Schedule

Description:

Layers are members of a unified engineering effort. ... grep phone /http/www.cs.wisc.edu/ % gcc /nest/turkey.cs.wisc.edu/input.c ... – PowerPoint PPT presentation

Number of Views:45
Avg rating:3.0/5.0
Slides: 45
Provided by: dougla9
Learn more at: http://www.cse.nd.edu
Category:

less

Transcript and Presenter's Notes

Title: Systems Seminar Schedule


1
Systems Seminar Schedule
  • 1 October - Douglas Thain
  • Error Management in Virtual Operating System
  • 15 October - Andrea Arpaci-Dusseau
  • Information and Control in Gray-Box Systems
  • 29 October - John Bent
  • Creating Communities for Grid I/O
  • 12 November - Open
  • 26 November - Open
  • 10 December - Open

2
Error Managementin a VirtualOperating System
  • Douglas Thain
  • Condor Project
  • University of Wisconsin

3
What is a Virtual OS?
App 2
App 3
Virtual OS 2
App 4
Device Drivers
Operating System
Device Drivers
Hardware
4
Why Use a Virtual OS?
  • To test and deploy software that would otherwise
    require destructive changes. (Wine, User Mode
    Linux)
  • To improve indirection or fault-tolerance.
    (Rocks, Socks, Grid Console)
  • To transparently harness exterior resources.
    (UFO, Condor, PFS)

5
Harness the Grid
Virtual OS 1
Virtual OS 2
App 3
App 4
App 1
App 2
6
In a Standard OS,Errors are not Difficult
App
errno
Standard Library
  • Layers are members of a unified engineering
    effort.
  • A standard namespace and scheme are used
    end-to-end.
  • Most interfaces closely resemble the underlying
    implementation.
  • Most catastrophic failures are coordinated.

errno
OS Kernel
errno
File System
errno
Device Driver
7
Handling Errors is a Serious Problem On the Grid
  • It is an important problem to solve
  • As systems grow more complex, MTBF-gt0.
  • Failures are generally uncoordinated.
  • Propagating knowledge of failure is more
    important than increasing likelihood of success.
  • It is a difficult problem to solve
  • Theoretical Matching different abstractions.
  • Technical Mating different langauges and
    conventions.
  • Social Coordinating distinct engineering efforts.

8
Error ManagementA Problemof Depth
App
POSIX
Tape Archive
Virtual OS
DDI
DDI
Disk Cache
FTP Driver
DDI
Globus
Globus FTP Library
Globus FTP Library
Unitree OS
FTP Server
Unitree
Globus
FTP
9
A Problem of Width
App
errno
Virtual Operating System
UNIX Driver
SRB Driver
FTP Driver
NeST Driver
Kangaroo Driver
Globus GASS Driver
An Alphabet Soup of Protocols, APIs, Systems,
Authorities, and Authors
10
A Problem ofDesign Direction
App
App
Bottom Up Design
???
errno
Application Library
Virtual OS
Outside In Design
errno
DDI
Standard Library
FTP Driver
errno
Globus
OS Kernel
FTP Library
11
How do wecorrectlyrepresent errorsin avirtual
operating system?
12
Spirit of this Talk
  • Software design involves striking balances --
    there is no trivial answer.
  • Concentrate on presenting several concrete
    problems and working solutions.
  • Given these data points, I will present some
    reasonable generalizations.
  • Languages and conventions are ancillary issues.
  • e.g. Exceptions vs. signals vs. errnos
  • Discussion and disagreement are welcome!

13
App
Bypass
The Pluggable File System
Local Driver
SRB Driver
Kangaroo Driver
GridFTP Driver
NeST Driver
HTTP Driver
Kangaroo Library
SRB Library
GridFTP Library
NeST Library
HTTP Library
Grid Services
Host Operating System
14
Examples of PFS
  • vi /gsiftp/vulture.cs.wisc.edu/etc/hosts
  • grep phone /http/www.cs.wisc.edu/
  • gcc /nest/turkey.cs.wisc.edu/input.c
  • -o /kangaroo/khaki.ncsa.uiuc.edu/output

15
A Kernel on Top of a Kernel
The Pluggable File System
File Descriptors
0
1
2
3
4
5
6
7
8
9
10
11
12
namei
65
1001
0
150
126
File Pointers
Current Working Directory
/tmp/input
/gsiftp /host/ out.10
/srb /host /tmp/data
/kangaroo /host /etc/hosts
File Objects
Mount Table
Local Driver
SRB Driver
Kangaroo Driver
GridFTP Driver
NeST Driver
HTTP Driver
Host Operating System
16
Not a Complete Virtual OS
  • Does not address process management,
    synchronization, etc...
  • Complete enough to be put to good use with real,
    non-trivial applications.
  • Gaussian - atomic model simulation
  • CMSIM - simulation of CERN LHC
  • POVray - ray tracing software
  • Structure and concept are developed enough to
    explore other OS issues others welcome!

17
Top-Level Error Space
  • A single namespace of integer errors that apply
    to all levels of the system.
  • Any call is free to return any possible error.
    (124)
  • General vs specific
  • ENOENT vs ECHILD
  • Some artifacts
  • EACCESS vs EPERM
  • EADV and EDOTDOT

EPERM 1 / Operation not permitted / ENOENT 2
/ No such file or directory / ESRCH 3 / No
such process / EINTR 4 / Interrupted system
call / EIO 5 / I/O error / ENXIO 6 / No
such device or address / E2BIG 7 / Arg list
too long / ENOEXEC 8 / Exec format error
/ EBADF 9 / Bad file number / ECHILD 10 / No
child processes / EAGAIN 11 / Try again
/ ENOMEM 12 / Out of memory / EACCES 13 /
Permission denied / ..
18
Concrete Problemsand Solutions
  • Too little information - file transfer replies
    (FTP)
  • Stick your head in the sand.
  • Grope in the dark.
  • Never forget a face.
  • Too much information - infinite namespace (SRB)
  • Divide and conquer.
  • Appeal to a higher power.
  • New failure modes - login errors (Globus)
  • Take it easy.
  • Split hairs.

19
The Problem ofToo Little Information
20
Too Little InformationFTP Replies
  • Integer codes indicate the severity of a response
    to an action.
  • Many transfer problems are identified, but few
    file system problems are.
  • Third digit specified infrequently, and for wide
    classes of errors.

100 - Positive Preliminary 200 - Positive
Completion 300 - Positive Intermediate 400 -
Transient Negative 500 - Permanent negative 000
- Syntax 010 - Information 020 - Connections 030
- Authentication 040 - Unspecified 050 - File
System 550 e.g. File not found, no access
21
Too LittleInformationFTP Replies
App
Virtual OS
FTP Server
FTP Driver
22
Too little InformationStick your head in the
sand
  • If you dont understand the failure, keep trying
    until the result is acceptable.
  • Might work for transient errors.
  • Might even work for the savvy user that can
    identify and fix problems.

23
Too little InformationGrope in the Dark
  • if GET succeeds
  • return success
  • else
  • if CHDIR succeeds
  • return EISDIR
  • else
  • if LIST succeeds
  • return EACCESS
  • else
  • return ENOENT
  • end
  • end
  • end

GET
CHDIR
LIST
EACCESS
24
Too little InformationNever Forget a Face
  • Each error condition has a signature
  • Server identifier wuftpd 4.1 ftp.cs
  • Operation attempted GET
  • Message in reply 550 Pas de tallenmand...
  • First Grope and then cache the determined error
    along with the signature.
  • Problems
  • Server must be consistent
  • Groping is not atomic

25
The Problem ofToo Much Information
26
Too Much InfoSRB Replies
  • Multiplexes many server backends into one client
    interface.
  • Error space is an amalgam of all back end error
    spaces.
  • Any call may return any error.
  • 1026 and growing!

UNIX_EPERM -1301 UNIX_ENOENT -1302 . .
. UNIX_EDEADLOCK -1356
HPSS_EPERM -1401 HPSS_ENOENT -1402 . .
. HPSS_NOCOS -1499
SQL_RSLT_TOO_LONG -1600
HTTP_ERR_BAD_PATH -1700
MCAT_OPEN_ERROR -3001 MCAT_CONNECT_ERROR -3002 .
. . MCAT_USER_NOT_IN_DOMN -3032
27
Too Much InformationDivide and Conquer
UNIX_EPERM -1301 UNIX_ENOENT -1302 . .
. UNIX_EDEADLOCK -1356
EPERM
ENOENT
ESRCH
HPSS_EPERM -1401 HPSS_ENOENT -1402 . .
. HPSS_NOCOS -1499
EINTR
EIO
SQL_RSLT_TOO_LONG -1600
EACCESS
HTTP_ERR_BAD_PATH -1700
EISDIR
MCAT_OPEN_ERROR -3001 MCAT_CONNECT_ERROR -3002 .
. . MCAT_USER_NOT_IN_DOMN -3032
OTHER
28
Appeal to a Higher Power
App
Human
Virtual OS
SRB Server
SRB Driver
29
The Problem ofNew Failure Modes
30
New Failure ModesLogin Errors
App
Virtual OS
GSI Driver
GSI Resource
Identify Certificate
31
New Failure ModesLogin Errors
class Error Error trigger Module
place_in_code Object thing_in_question String
message
  • Hierarchy of error objects, much like Java.
  • Errors may be identified by individual type or
    their membership in a more general type.

Error
Authen- tication
Author- ization
Commun- ication
No Creds
Expired Creds
No Trust
32
New Failure ModesTake it Easy
Not Authorized
  • Easy for program to interpret and react.
  • Difficult for a human to debug.

Couldnt Authenticate
EACCES
Protocol Not Supp.
No identity
33
New Failure ModesSplit Hairs
Not Authorized
EACCES
  • Preserves unique error types for the savvy user.
  • Program may not be prepared to react to arbitrary
    error values.

EPERM
Couldnt Authenticate
Protocol Not Supp.
EPROTO
No identity
ESRCH
34
New Failure ModesRocks Solution
App
  • Reliable Sockets by Vic Zandy
  • Give a general error code along the standard
    channel.
  • Give a detailed message along a back channel.

Connection Lost
Reliable Sockets
rserrno
Reconnection Timeout Expired
Connection Refused
Standard Sockets
35
A Toolbox forError Conversions
  • Simple Conversions
  • Take it Easy
  • Split Hairs
  • Divide and Conquer
  • Grope in the Dark
  • Never Forget a Face
  • Appeal to a Higher Power
  • Stick your Head in the Sand

Increasing Cost
36
Error Accuracy can beA Performance Concern
  • We can always find some way to produce a correct
    -- even if undesired -- execution.
  • But -
  • An Appeal to a Higher Power causes badput.
  • Groping in the Dark yields high latencies.
  • Head in the Sand may keep trying when no
    automatic recovery is possible.
  • ...or, a failure to retry results in unnecessary
    user interaction.

37
Hints for Error Design
  • 1 - Express errors in terms of the interface.
  • 2 - Assume the audience is a program.
  • 3 - Leave room to expand, but avoid using it.
  • 4 - Give the essence, not the details.

38
1 - Express Errors in Termsof the Interface
Application
  • Essence of separation of interface and
    implementation.
  • The user of an interface should not see a moving
    target as the implementation changes.

File Interface
Disk Impl
Network Impl
Memory Impl
???
39
2 - Assume the Audienceis a Program
  • A computer-readable error can be used as the
    basis for a decision at any level.
  • A human-readable error can only result in a blind
    retry or an Appeal.
  • Computer-readable errors are easily made
    human-readable.

Human
Decision
Decision
Layer2
Decision
???
Layer 1
???
Decision
Error Text
Error Code
Layer 0
40
3 - Leave Room to Expand...but Avoid Using It
  • Any significantly different implementation of an
    interface will introduce new failure modes.
  • Possibilities for a new failure
  • Best case fit it into an existing error.
  • Medium case return unknown error.
  • Worst case Appeal to a Higher Power.

41
4 - Give the Essence,not the Details
  • The details distract the caller from the nature
    of the problem and result in cascading Appeals.
  • Example in file systems
  • Fell off the end of the directory linked list.
  • or No file by that name.
  • Example in networking
  • Timer went off, but no network interrupt
    received.
  • or Connection lost.
  • Example in security
  • Failure in PEM_do_header while reading
    password.
  • or You have no credentials.
  • A restatement of hint 1.

42
Hall of Fame
  • All authors remain anonymous.
  • Error in return value.
  • A system call failed!
  • Could not execute job. Reason Success

43
In Summary...
  • Error management is part of the art of software
    engineering.
  • The importance and the difficulty of error
    management are magnified in a virtual operating
    system.
  • All errors have some value, but low-signal errors
    result in performance problems.
  • Hints for error interface design.

44
Contact Info
  • Douglas Thain
  • thain_at_cs.wisc.edu
  • Software and other info
  • http//www.cs.wisc.edu/condor/pfs
  • Questions and discussion?
Write a Comment
User Comments (0)
About PowerShow.com