Title: Accelerating TwoDimensional Page Walks for Virtualized Systems
1Accelerating Two-Dimensional Page Walks for
Virtualized Systems
2Introduction
- Native non-virtualized system
- We have a OS running on a physical system.
- OS communicates with physical system directly.
- Address Mapping
- Virtual Address The address used in OS
application software. - Physical Address The address in physical
machine. - For native system VA-gtPA.
3Introduction
- Virtualization
- Multiple OS can run simultaneously but
separately - on one physical system.
- hypervisor underlying software used to insert
- abstractions into virtualized system and
manipulate - the communication between OS and physical
- system.
4Introduction
- Virtualization
- Address mapping for Virtual Machine.
- Guest OS Guest Virtual Address (GVA), Guest
Physical Address. (GPA) - Physical system System Physical Address(SPA).
- Address translation
- GVA-gtGPA-gtSPA
-
-
5Introduction
- Virtualization
- Tradition idea for memory translation
manipulated by hypervisor. - Drawbacks hypervisor intercepts operation,
exits guest, emulates the operation and does
memory translation and then return back to guest.
-gt high overhead. - Alternative idea
- Using hardware to finish translation.
- Dont need hypervisor, save overhead.
-
6Background
- X86 Native Page Translation
- Page table
- use hierarchical address-translation tables to
map VA to PA. - Page walk
- an iterative process.
- In order to get the final PA from VA, we need a
page walk and traverse all level page table
hierarch.
7Background
- X86 Native Page Translation
- From level 4
down to level 1. - A physical address from above
level is used as base address and 9-bit
VA is used as offside. - TLB(Translation look-aside buffers)
caches the final physical address to reduce
frequency of page walks.
8Background
- Memory Management for Virtualization
- Without hardware support, we should use
hypervisor to manipulate this translation. This
is one important overhead for hypervisor. (Using
shadow page table to map GVA to SPA) - Hardware mechanism
- Same idea as X86 page walking. (2D page walking)
- Nested paging map GPA to SPA.
-
9Background
- Memory Management for Virtualization
- Traverse guest page table to translate
GVA to GPA. For each level, original
GPA should be translated to SPA by
walking nested page table for each gL (guest
page table) to read. TLB caches the
final SPA to reduce page walk overhead.
10Background
- Large page size advantages
- Memory saving
- With 4 KB pages, an OS should use entire L1
table which is 4 KB large. If we can make all 512
4 KB into a 2 MB contiguous block, we can escape
L1 so we save 4 KB space used by L1. - Reduction in TLB pressure
- Each large page table entry can be stored in a
single TLB entry while the corresponding regular
page entries require 512 4 KB TLB entries to map
the same 2 MB range of virtual address. - Shorter page walk
- Escape the entire L1, the page walking is
shorter and therefore save some overhead.
11Page walk characterization
- Page walk cost
-
- Perfect TLB Opportunity means the performance
improvement that could be achieved with a perfect
TLB which eliminates cold misses as well as
conflict and capacity misses.
12Page walk characterization
13Page walk characterization
14Page walk characterization
- Page entry reuses
- Nested page tables have much higher reuse than
guest page tables, in part due to the inherent
redundancy of the nested page walk. - There are many more nested accesses than guest
accesses in a 2D page walk. Each level of the
nested page table hierarchy must be accessed for
each guest level. In many cases the same nested
page entries are accessed multiple times in a 2D
page walk (high reuse rate).
15Page walk characterization
ltgL1,Ggt and ltgPA, nL1gt both have high unique page
entries because both of them map guest data into
their respective address space. lt gL1,G gt maps
GVA-gt GPA. lt gPA, nL1 gt maps GPA -gt SPA. So these
two are most difficult to be cached.
16Page Walk Acceleration
- AMD Opteron Translation Caching
- Page walk cache(PWC)
- stores page entries from all page table levels
except L1, which is stored in TLB. - All page entries are initially brought into L2
cache. On a PWC miss, the page entry data may
reside in the L2 cache, L3 cache(if present). -
-
-
17Page Walk Acceleration
- Translation caching for 2D page walks
-
-
-
18Page Walk Acceleration
- Translation caching for 2D page walks
- One Dimensional PWC(1D_PWC)
- Only page entry data from the guest dimension
are stored in the PWC and the entries are tagged
based on the system physical address. - The lowest level guest page table entry G,gL1
is not cached in the PWC because of its low reuse
rate. - Two-Dimensional PWC (2D PWC)
- Extends 1D PWC into the nested dimension of the
2D page walk. Turning the 20 unconditional cache
hierarchy accesses into 16 likely PWC hits
(dark-?lled references in Figure 5(b)) and four
possible PWC hits (checkered references. Like 1D
PWC, all page entries are tagged with their
system physical address and G,gL1 is not
cached. -
19Page Walk Acceleration
- Translation caching for 2D page walks
- Two-Dimensional PWC with Nested Translations (2D
PWCNT) - Augment 2D PWC with a dedicated GPA to SPA
translation buffer, the Nested TLB (NTLB), which
is used to reduce the average number of page
entry references that take place during a 2D page
walk. - The NTLB uses the guest physical address of the
guest page entry to cache the corresponding nL1
entry. - The page walk begins by accessing the NTLB with
the guest physical address of G,gL4 and produce
the data of nL1,gL4, allowing nested references
1-4 to be skipped. On an NTLB hit, the system
physical address of G,gL4 needed for the PWC
access is calculated.
20Result
- Benchmark we will use in the following slides
21Result
The three hardware-only page walk caching schemes
improve performance by turning page entry memory
hierarchy references into lower latency PWC
accesses and, in the case of 2D PWCNT, skipping
some page entry references entirely.
22Result
Left side G column is not skipped, so it does
not change. So does gPA row. gL1 in 2D_PWCNT is
skipped in 2D_PWCNT though it has a low reuse
rate. So it exhibits a shorter space in 2D_PWC_NT
than in 2D_PWC. Right side NTLB eliminates many
of the PWC accesses, but it does not eliminate a
signi?cant portion of the accesses that have the
highest penalty.
23Result
-
- The ?rst data column states that L2 accesses
incurred during a 2D page walk using the 2D - PWCNT con?guration generate 2.7-5.5 times more
L2 misses than the native page walk. - This increase is primarily because the native
page walk has fewer entries that are dif?cult - to cache (L1 and sometimes L2) compared to the
2D page walk (G,gL1, nL1,gPA and - sometimes G,gL2, nL2,gPA, nL1,gL1, and
nL2,gL1). - The second data column shows the L2 cache miss
percentage due only to page entries from
24Result
The 8096 w/(G, gL1) con?guration is unique in
that it writes the gL1 guest page entry to the
PWC.
25Result
Large pages allow the TLB to cover a larger data
region with fewer translations, which will lead
to less TLB missing. (the nL1 references for the
gPA, gL1, gL2, gL3,and gL4 levels are all
eliminated. ) The ability to eliminate
poor-locality references, like nL1,gL1 and
nL1,gPA, reduces the number of L2 cache misses
by 60-64.
26Conclusion
- Nested paging is a hardware technique to reduce
the complexity of - software memory management during system
virtualization. Nested - page tables combine with the guest page tables
to map GPA to SPA, - resulting in a two-dimensional (2D) page
walk(2D_PWC, 2D_PWCNT). - A hypervisor is no longer required to trap on all
guest page table - updates and significant virtualization overhead
is eliminated. However, - nested paging can introduce new overhead due to
the increase in page - entry references.
- Therefore, the overall performance of a
virtualized system is improved - by nested paging when the eliminated hypervisor
memory management - overhead is greater than the new 2D page walk
overhead.