Title: Xen and the Art of Virtualization SOSP 2003
1Xen and the Art of VirtualizationSOSP 2003
- Paul Barham, Boris Dragovic, Keir Fraser, Steven
Hand, Tim Harris, Alex Ho, Rolf Neugebauery, Ian
Pratt, Andrew Wareld - University of Cambridge Computer Laboratory
2Virtualization Overview
- Single OS image Virtuozo, Vservers, Zones
- Group user processes into resource containers
- Hard to get strong isolation
- Full virtualization VMware, VirtualPC, QEMU
- Run multiple unmodified guest OSes
- Hard to efficiently virtualize x86
- Para-virtualization UML, Xen
- Run multiple guest OSes ported to special arch
- Arch Xen/x86 is very close to normal x86
3X86 CPU Virtualization
- Xen runs in ring 0 (most privileged)
- Ring 1/2 for guest OS, 3 for user-space
- GPF if guest attempts to use privileged instr
- Xen lives in top 64MB of linear addr space
- Segmentation used to protect Xen as switching
page tables too slow on standard x86 - Hypercalls jump to Xen in ring 0
- Guest OS may install fast trap handler
- Direct user-space to guest OS system calls
- MMU virtualisation shadow vs. direct-mode
4MMU Virtualization
- Critical for performance, challenging to make
fast, especially SMP - Xen supports 3 MMU virtualization modes
- Direct page tables
- Shadow page tables
- Hardware Assisted Paging
- OS Paravirtualization compulsory for 1, optional
(and very beneficial) for 2 3
5MMU Virtualization Direct-Mode
6MMU Virtualization Shadow-Mode
7Para-Virtualizing the MMU
- Guest OSes allocate and manage own PTs
- Hypercall to change PT base
- Xen must validate PT updates before use
- Allows incremental updates, avoids revalidation
- Validation rules applied to each PTE
- Guest may only map pages it owns
- Page table pages may only be mapped RO
- Xen traps PTE updates and emulates, or unhooks
PTE page for bulk updates
8MMU Micro-Benchmarks
1.1
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
L
X
V
U
L
X
V
U
Page fault (µs)
Process fork (µs)
lmbench results on Linux (L), Xen (X), VMWare
Workstation (V), and UML (U)
9I/O Architecture
- Xen IO-Spaces delegate guest OSes protected
access to specified h/w devices - Virtual PCI configuration space
- Virtual interrupts
- Devices are virtualised and exported to other VMs
via Device Channels - Safe asynchronous shared memory transport
- Backend drivers export to frontend drivers
- Net use normal bridging, routing, iptables
- Block export any blk dev e.g. sda4,loop0,vg3
10Device Channel Interface
11System Performance
1.1
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
L
X
V
U
L
X
V
U
L
X
V
U
L
X
V
U
SPEC INT2000 (score)
Linux build time (s)
OSDB-OLTP (tup/s)
SPEC WEB99 (score)
Benchmark suite running on Linux (L), Xen (X),
VMware Workstation (V), and UML (U)
12TCP results
1.1
1.0
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0.0
L
X
V
U
L
X
V
U
L
X
V
U
L
X
V
U
Tx, MTU 1500 (Mbps)
Rx, MTU 1500 (Mbps)
Tx, MTU 500 (Mbps)
Rx, MTU 500 (Mbps)
TCP bandwidth on Linux (L), Xen (X), VMWare
Workstation (V), and UML (U)
13Xen 3.0 Architecture
VM3
VM0
VM1
VM2
Device Manager Control s/w
Unmodified User Software
Unmodified User Software
Unmodified User Software
GuestOS (XenLinux)
GuestOS (XenLinux)
GuestOS (XenLinux)
Unmodified GuestOS (WinXP))
AGP ACPI PCI
Back-End
Back-End
SMP
Native Device Driver
Native Device Driver
Front-End Device Drivers
Front-End Device Drivers
VT-x
Event Channel
Virtual MMU
Virtual CPU
Control IF
Safe HW IF
32/64bit
Xen Virtual Machine Monitor
Hardware (SMP, MMU, physical memory, Ethernet,
SCSI/IDE)
14Live Migration of Virtual Machines
- Christopher Clark, Keir Fraser, Steven Hand,
Jacob Gorm Hansen, Eric Jul, Christian Limpach,
Ian Pratt, Andrew Warfield - University of Cambridge Computer Laboratory
15Motivation
- VM relocation enables
- High-availability
- Machine maintenance
- Load balancing
- Statistical multiplexing gain
16Assumptions
- Networked storage
- NAS NFS, CIFS
- SAN Fibre Channel
- iSCSI, network block dev
- drdb network RAID
- Good connectivity
- common L2 network
- L3 re-routeing
17Strategy
18Strategy 2
VM active on host A Destination host
selected (Block devices mirrored)
Stage 0 pre-migration
Initialize container on target host
Stage 1 reservation
Copy dirty pages in successive rounds
Stage 2 iterative pre-copy
Suspend VM on host A Redirect network
traffic Synch remaining state
Stage 3 stop-and-copy
Activate on host B VM state on host A released
Stage 4 commitment
19Pre-Copy Migration Round 1
20Pre-Copy Migration Round 1
21Pre-Copy Migration Round 1
22Pre-Copy Migration Round 1
23Pre-Copy Migration Round 1
24Pre-Copy Migration Round 2
25Pre-Copy Migration Round 2
26Pre-Copy Migration Round 2
27Pre-Copy Migration Round 2
28Pre-Copy Migration Round 2
29Pre-Copy Migration Final
30Writable Working Set
- Set of pages written to by OS/application
- Pages that are dirtied must be re-sent
- Hot pages
- E.g. process stacks
- Top of free page list (works like a stack)
- Buffer cache
- Network receive / disk buffers
31Page Dirtying Rate
- Dirtying rate determines VM down-time
- Shorter iters ? less dirtying ? shorter iters
- Stop and copy final pages
- Application phase changes create spikes
32ThanksThe End