Accelerating TwoDimensional Page Walks for Virtualized Systems - PowerPoint PPT Presentation

1 / 26

About This Presentation

Title:

Accelerating TwoDimensional Page Walks for Virtualized Systems

Description:

Address mapping for Virtual Machine. ... memory translation: manipulated by ... TLB(Translation look-aside buffers) caches the final physical address to reduce ... – PowerPoint PPT presentation

Number of Views:136

Avg rating:3.0/5.0

Slides: 27

Provided by: mmj6

Category:

more less

Transcript and Presenter's Notes

Title: Accelerating TwoDimensional Page Walks for Virtualized Systems

1
Accelerating Two-Dimensional Page Walks for
Virtualized Systems

Jun Ma

2
Introduction

Native non-virtualized system
We have a OS running on a physical system.
OS communicates with physical system directly.
Address Mapping
Virtual Address The address used in OS
application software.
Physical Address The address in physical
machine.
For native system VA-gtPA.

3
Introduction

Virtualization
Multiple OS can run simultaneously but
separately
on one physical system.
hypervisor underlying software used to insert
abstractions into virtualized system and
manipulate
the communication between OS and physical
system.

4
Introduction

Virtualization
Address mapping for Virtual Machine.
Guest OS Guest Virtual Address (GVA), Guest
Physical Address. (GPA)
Physical system System Physical Address(SPA).
Address translation
GVA-gtGPA-gtSPA

5
Introduction

Virtualization
Tradition idea for memory translation
manipulated by hypervisor.
Drawbacks hypervisor intercepts operation,
exits guest, emulates the operation and does
memory translation and then return back to guest.
-gt high overhead.
Alternative idea
Using hardware to finish translation.
Dont need hypervisor, save overhead.

6
Background

X86 Native Page Translation
Page table
use hierarchical address-translation tables to
map VA to PA.
Page walk
an iterative process.
In order to get the final PA from VA, we need a
page walk and traverse all level page table
hierarch.

7
Background

X86 Native Page Translation
From level 4
down to level 1.
A physical address from above
level is used as base address and 9-bit
VA is used as offside.
TLB(Translation look-aside buffers)
caches the final physical address to reduce
frequency of page walks.

8
Background

Memory Management for Virtualization
Without hardware support, we should use
hypervisor to manipulate this translation. This
is one important overhead for hypervisor. (Using
shadow page table to map GVA to SPA)
Hardware mechanism
Same idea as X86 page walking. (2D page walking)
Nested paging map GPA to SPA.

9
Background

Memory Management for Virtualization
Traverse guest page table to translate
GVA to GPA. For each level, original
GPA should be translated to SPA by
walking nested page table for each gL (guest
page table) to read. TLB caches the
final SPA to reduce page walk overhead.

10
Background

Large page size advantages
Memory saving
With 4 KB pages, an OS should use entire L1
table which is 4 KB large. If we can make all 512
4 KB into a 2 MB contiguous block, we can escape
L1 so we save 4 KB space used by L1.
Reduction in TLB pressure
Each large page table entry can be stored in a
single TLB entry while the corresponding regular
page entries require 512 4 KB TLB entries to map
the same 2 MB range of virtual address.
Shorter page walk
Escape the entire L1, the page walking is
shorter and therefore save some overhead.

11
Page walk characterization

Page walk cost
Perfect TLB Opportunity means the performance
improvement that could be achieved with a perfect
TLB which eliminates cold misses as well as
conflict and capacity misses.

12
Page walk characterization

Page entry reuses

13
Page walk characterization

Page entry reuses

14
Page walk characterization

Page entry reuses
Nested page tables have much higher reuse than
guest page tables, in part due to the inherent
redundancy of the nested page walk.
There are many more nested accesses than guest
accesses in a 2D page walk. Each level of the
nested page table hierarchy must be accessed for
each guest level. In many cases the same nested
page entries are accessed multiple times in a 2D
page walk (high reuse rate).

15
Page walk characterization

Page entry reuses

ltgL1,Ggt and ltgPA, nL1gt both have high unique page
entries because both of them map guest data into
their respective address space. lt gL1,G gt maps
GVA-gt GPA. lt gPA, nL1 gt maps GPA -gt SPA. So these
two are most difficult to be cached.
16
Page Walk Acceleration

AMD Opteron Translation Caching
Page walk cache(PWC)
stores page entries from all page table levels
except L1, which is stored in TLB.
All page entries are initially brought into L2
cache. On a PWC miss, the page entry data may
reside in the L2 cache, L3 cache(if present).

17
Page Walk Acceleration

Translation caching for 2D page walks

18
Page Walk Acceleration

Translation caching for 2D page walks
One Dimensional PWC(1D_PWC)
Only page entry data from the guest dimension
are stored in the PWC and the entries are tagged
based on the system physical address.
The lowest level guest page table entry G,gL1
is not cached in the PWC because of its low reuse
rate.
Two-Dimensional PWC (2D PWC)
Extends 1D PWC into the nested dimension of the
2D page walk. Turning the 20 unconditional cache
hierarchy accesses into 16 likely PWC hits
(dark-?lled references in Figure 5(b)) and four
possible PWC hits (checkered references. Like 1D
PWC, all page entries are tagged with their
system physical address and G,gL1 is not
cached.

19
Page Walk Acceleration

Translation caching for 2D page walks
Two-Dimensional PWC with Nested Translations (2D
PWCNT)
Augment 2D PWC with a dedicated GPA to SPA
translation buffer, the Nested TLB (NTLB), which
is used to reduce the average number of page
entry references that take place during a 2D page
walk.
The NTLB uses the guest physical address of the
guest page entry to cache the corresponding nL1
entry.
The page walk begins by accessing the NTLB with
the guest physical address of G,gL4 and produce
the data of nL1,gL4, allowing nested references
1-4 to be skipped. On an NTLB hit, the system
physical address of G,gL4 needed for the PWC
access is calculated.

20
Result

Benchmark we will use in the following slides

21
Result
The three hardware-only page walk caching schemes
improve performance by turning page entry memory
hierarchy references into lower latency PWC
accesses and, in the case of 2D PWCNT, skipping
some page entry references entirely.

22
Result

Left side G column is not skipped, so it does
not change. So does gPA row. gL1 in 2D_PWCNT is
skipped in 2D_PWCNT though it has a low reuse
rate. So it exhibits a shorter space in 2D_PWC_NT
than in 2D_PWC. Right side NTLB eliminates many
of the PWC accesses, but it does not eliminate a
signi?cant portion of the accesses that have the
highest penalty.
23
Result

The ?rst data column states that L2 accesses
incurred during a 2D page walk using the 2D
PWCNT con?guration generate 2.7-5.5 times more
L2 misses than the native page walk.
This increase is primarily because the native
page walk has fewer entries that are dif?cult
to cache (L1 and sometimes L2) compared to the
2D page walk (G,gL1, nL1,gPA and
sometimes G,gL2, nL2,gPA, nL1,gL1, and
nL2,gL1).
The second data column shows the L2 cache miss
percentage due only to page entries from

24
Result
The 8096 w/(G, gL1) con?guration is unique in
that it writes the gL1 guest page entry to the
PWC.

25
Result

Large pages allow the TLB to cover a larger data
region with fewer translations, which will lead
to less TLB missing. (the nL1 references for the
gPA, gL1, gL2, gL3,and gL4 levels are all
eliminated. ) The ability to eliminate
poor-locality references, like nL1,gL1 and
nL1,gPA, reduces the number of L2 cache misses
by 60-64.
26
Conclusion