Title: Debugging operating systems with
1Debugging operating systems with time-traveling
virtual machines
by S.T. King, G.W. Dunlap, P.M. Chen
Presented by Mirna Limic
2What is it? and Why have one?
virtual machine (VM) software abstr. of a
physical machine
time travel ability to navigate through
immutable execution history
time-traveling virtual machine (TTVM)
Used to debug an operating system (OS) since an
OS is - non-deterministic - runs for long
periods of time - debugging may perturb its
state - it interacts directly with hardware
devices
3virtual machine monitor (VMM) software layer
that provides the abstraction of a virtual
machine guest OS the OS which is run on a
VMM host OS the OS on which VMM runs IDEA
EXTEND gdb TO MAKE USE OF TIME TRAVEL
guest-user host process
guest-kernel host process
gdb
TTVM functionality (checkpointing, logging,
replay)
host OS
4VM state VM's physical memory, the virtual disk,
CPU registers, and any state in VMM or host
kernel that affects the execution of the
virtual machine run the time from which the
virtual machine was powered on to the last
instruction it executed
TTVMs capabilities 1. reconstruct the complete
state of the VM at any point in a run 2. start
from any point in a run and replay the
instruction stream executed during the original
run
logging, replay, checkpointing
achieved with
5VMM, logging and replaying
VMM used is User-Mode Linux (UML). Logging/replay
system used is ReVirt.
Host device drivers in the guest OS
UML exports a set of virtual devices with no
hardware equivalent. Problem How to debug device
drivers? Workaround modify UML to run real
device drivers in the guest OS. Result I/O
instructions and DMA requests of guest OS are
forwarded to host hardware
6Host device drivers in the guest OS
- Logging is performed on any information sent
from - device to the driver (IN instructions,
memory-mapped - I/O instructions, and DMA memory loads.
- Host OS provides regions of its physical memory
for - guest's memory-mapped I/O and DMA.
- Potential problem Corruption of host's memory?
- Solution Deny access to memory outside the
- intended region
7Checkpointing
It is used to speed up time travel over long time
periods. It is done by logging memory and disk
accesses into undo and redo logs. Difference
memory log the actual pages at every
checkpoint into undo and redo logs disk
log multiple versions of guest disk blocks but
only keep the changes to the guest -gt host
disk block map in the undo and redo logs
8Checkpointing logging of memory
checkpoint1
checkpoint3
checkpoint2
write
write
write
write
write
write
A
E
D
A
C
B
A
A
A
A
B
D
B
D
C
E
C
E
redo log
redo log
undo log
undo log
9TTVM-aware gdb
Commands added to gdb reverse continue - takes
the VM back to previous point (point is a reverse
equivalent of forward breakpoint, watchpoint,
and step) reverse step goes back a specified
number of instructions goto jumps to an
arbitrary time in the execution
10Performance
Machine uniprocessor 3 Ghz Pentium 4, 1 GB
memory, 120 GB Hitachi Deskstar GXP disk Host
OS Linux 2.4.18 with UML running in skas
mode, and TTVM modifications. Guest OS 256
MB memory, 5 GB disk. Both guest and host
filesystems initialized from RedHat 9. Three
guest workloads measured - SPEC99web using
Apache (Spec99web is benchmark for evaluating
performance of www servers, - 3 successive
builds of linux 2.4 kernel where each
build executes make clean make dep make
bzImage - PostMark filesystem benchmark.
11Performance (cont'd)
Time and space overhead of logging for the three
workloads Logging without checkpointing
Time overh. Space overh. Spec99web 12
85 KB/sec kernel build 11 7
KB/sec Postmark 3 2 KB/sec Replay
without checkpointing 1 3 longer for all
three workloads
12Performance (cont'd)
Running time with checkpointing Running times
are normalized to running the workload
without any checkpoints
Workload without Checkpoints SPEC99web 1135
sec kernel build 1027 sec PostMark 1114 sec
13Performance (cont'd)
Space overhead of checkpointing
14Performance (cont'd)
Time to restore a checkpoint
15A common problem with traditional debuggers is
that using the debugger changes the timing of
events in the application. Are you convinced
that this particular implementation can
reproduce the playback reliable enough for
debugging purposes?
Would you say that authors can claim the
debugging strength of their TTVM based on the
debugging examples given in the paper?
16Do you think that this technique can be adapted
to be used in debugging of parallel applications
which generally require high replay cost and
complexity? (multiprocessors)
In general, which OS processes experience most
number of bugs and most significant bugs? Would
it be sufficient to monitor those sections of
the OS alone, with TTVM?
17How can TTVM be enhanced to identify OS bugs
that it has not yet encountered or might not
encounter in the near future? How can the entire
range of OS bugs be identified?
Do you think that this idea can be easily
applicable to non x86 architectures?
Would the guest kernel need to be modified if
TTVM is implemented on hardward-based
virtualization technology (eg. AMD-V, Intel-VT)?
18During check pointing how would you capture
network state and replay it later? What
would you log?