shared address translation revisited · 2016-04-20 · limitations of current shared memory...
TRANSCRIPT
![Page 1: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/1.jpg)
SHARED ADDRESS TRANSLATION REVISITED
Xiaowan Dong University of Rochester
Sandhya Dwarkadas University of Rochester
Alan L. Cox Rice University
![Page 2: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/2.jpg)
Limitations of Current Shared Memory Management
• Physical memory sharing is common
• However, address translation is private per process• page tables and Translation Lookaside Buffer
(TLB) entries
• Potential for duplicate translation information
• Scalability problem: O(# of processes)
• Inefficient utilization of shared caches
2
(as much as 58% on Android)
physical memory
Page Table
entry
Page Table
entry
TLB entry
…
TLB entry
Process 1 Process 2
Page Table
entry
Page Table
entry
![Page 3: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/3.jpg)
Previous Work
• Previous work shares page tables for applications handling large amounts of contiguous data• E.g., PostgreSQL database systems
• Limitations:• Overlook code at smaller granularity (such as shared libraries)• Ignore duplication in the TLB
• New opportunities on Android, where shared libraries are used intensively
3
![Page 4: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/4.jpg)
Android Process Creation Model
All applications share the same physical and virtual addresses for the preloaded libraries
4
![Page 5: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/5.jpg)
Goal: Shared Address Translation: Page Tables and TLB Entries
5
• Sharing address translation for the zygote-preloaded shared libraries
• Implemented at the OS level with existing hardware support• Mostly machine-independent
• Benefits• Reduce soft page faults
• Improve cache and TLB performance
physical page
Page Table
entry
TLB entry
Process 1&
Process 2
![Page 6: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/6.jpg)
Impact of Shared Libraries on Instruction Footprint• Number of shared libraries per application:
• Loaded: 88 to 107 (zygote-preloaded: 88)
• Invoked: 24 to 68 (zygote-preloaded: 21 to 46)
6
0%
20%
40%
60%
80%
100%
% of inst pages accessed
zygote-preloaded shared lib other shared lib
0%
20%
40%
60%
80%
100%
% of inst fetched
zygote-preloaded shared lib other shared lib
93% 98%
68% 72%
![Page 7: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/7.jpg)
Shared Library Instruction Footprint Intersection
• Considerable overlap in the shared library code accessed across different applications
• 46% of total inst pages accessed are in common for each pair of applications
• Zygote-preloaded: 38%
7
Laya Music Player
Adobe Reader
MX Player
91%
72%
85%
The % of inst footprint overlapped
![Page 8: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/8.jpg)
SHARING ADDRESS TRANSLATION
8
![Page 9: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/9.jpg)
Sharing Page Tables
• The ARM architecture defines a two-level hierarchical page table
• L2 page table pages are shared at fork time between the zygote and its child processes• Supports private writable memory regions
• Shared page table pages and physical pages should both be managed in a copy-on-write (COW) manner
9
L1 PTE
L1 PTE
L2 PTE
L2 PTE
L2 PTE
L2 PTE
L1 PTE
L1 PTE
L2 PTE
L2 PTE
L2 PTE
L2 PTE
Zygote
Android application
![Page 10: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/10.jpg)
Maintaining Shared Page Tables
• A shared page table page needs to be unshared (COWed) in the following cases:
• Page fault with write access
• A process creates, destroys, or modifies a memory region within the range of a shared page table page
• A process tries to free a shared page table page
• Modification to any memory region will lose the entire shared page table page• Mapping the page table entries of the code segment and data segment of a shared
library into different page table pages
10
![Page 11: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/11.jpg)
Sharing TLB Entries
• Global bit• We set the global bit in the page table entries of the zygote-preloaded shared
libraries’ code segments
• Overrides Address Space Identifier (ASID) in TLB
• Domain protection model of 32-bit ARM• Prevents processes not forked from the zygote from accessing the shared global
TLB entries
• E.g., system services and daemons
11
![Page 12: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/12.jpg)
12
Zygote-preloaded
shared libraries
User Space
Kernel Space
Domain 2Domain 1 Domain 3
… 00 …Non-zygote processes
… 01 …Zygote-like processes
Domain 3
DACR
VPN ASID 1 0011 Permission bits
Global bit Domain field
TLB
Memory Abort Handler Trap into kernel
Domain fault ?
Check fault status register
Flush all TLB entries with the faulting address
Leveraging the domain protection model
00: No access permission01: Based on permission bits listed in the TLB entry
![Page 13: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/13.jpg)
EVALUATION
13
![Page 14: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/14.jpg)
Evaluation Platforms
• Nexus 7 (2012)• 1.2GHz Nvidia Tegra 3 processor with four ARM Cortex-A9 cores• A private 2-level TLB
• I/D micro TLB (flushed over context switch)
• 128-entry main TLB
• 32KB/32KB L1 cache (I/D)• 1MB shared L2 cache
• Android KitKat 4.4.4 OS• New android runtime (ART)
• Benchmarks:• Most popular application in each category on Google Play Store
14
![Page 15: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/15.jpg)
Zygote Fork
• Sharing page table improves execution time of a zygote fork by 2.1x
• Trade-off between cost of fork and # of page faults experienced by child processes• Sharing page table is the best of both worlds
15
Kernel Execution Cycles (x 106) # of PTPs allocated # of PTEs copied
Stock Android 2.9 38 3,900
Copied PTEs 4.6 51 9,800
Shared PTPs 1.4 1 7
![Page 16: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/16.jpg)
Application Launch Performance
• Every application follows the same launch procedure before it loads its application-specific Java classes
• Launch time improved by 7% (10% with 2MB alignment)• 94% fewer page faults for creating PTEs that map shared code and data
• 15% reduction in L1 Icache stall cycles
• 68 % less page table page allocation
16
![Page 17: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/17.jpg)
Over The Course of Execution
17
38% fewer Page faults for creating PTEs that map shared code and data on average (maximum 78%)
35% fewer page table pages allocated(maximum 58%)
0%
20%
40%
60%
80%
100%
PTP allocation normalized to stock Android
![Page 18: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/18.jpg)
Android IPC Performance
• Inter-process communication (IPC) is common on Android
• Developed microbenchmark using Android IPC binder mechanism
• Inst main TLB stall cycles are reduced by:• Client: 36%
• Server: 19%
18
![Page 19: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/19.jpg)
Conclusion
• Android presents opportunities for shared library address translation sharing
• We eliminated the duplication of address translation on Android
• Android’s application launch, steady-state, and context switch efficiency are improved
• Speed up a zygote fork by 2.1x
• Improve application launch by 10%
• Our shared address translation infrastructure should be portable to other platforms
19
![Page 20: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/20.jpg)
Large Pages Are Inefficient for Zygote-preloaded Shared Libraries• Using large pages (64KB page for
example) will waste physical memory compared to 4KB base pages:• 2.6x memory consumption on average
• 94% more memory consumption for the union set
• Linux does not support the use of large pages for code
• Our design can complement large pages• 64KB page on ARM also requires 2-level
page table as 4KB page does
20
CDF of # of 4KB pages untouched within a 64KB large page of zygote-preloaded shared libraries
![Page 21: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/21.jpg)
Sharing TLB
21
Task_struct.zygote = 1
Vma.global= 1
mmap the codesegment of a shared library
fork
Task_struct.zygote_like =1
inherit
Vma.global= 1
zygote
exec
Task_struct.zygote =1 or
zygote_like = 1?
Page fault on a zygote-preloaded shared library
Vma.global = 1
?
Set global bit in PTE
yes
yes
Global bit is used for kernel pages in stock Linux
![Page 22: SHARED ADDRESS TRANSLATION REVISITED · 2016-04-20 · Limitations of Current Shared Memory Management •Physical memory sharing is common •However, address translation is private](https://reader033.vdocuments.us/reader033/viewer/2022060320/5f0d17e17e708231d438a542/html5/thumbnails/22.jpg)
Sharing Page Table at Fork
Parent’s addr space
vma1
vma2
vma3
L1 PTP
L1 PTE1
L1 PTE2
L1 PTE3
L2 PTP
L2 PTE1
L2 PTE2
L2 PTE3Child’s addrspace
vma1
vma2
vma3
L1 PTP
L1 PTE1
L1 PTE2
L1 PTE3
L2 PTP is shared?
No
Write-protect every writable L2 PTE
Shared PTP
Virtual memory area (VMA): a memory region
If ARM supports write protection in L1 PTE as x86, we can avoid write-protecting every L2 PTE