codesigned virtual machines
DESCRIPTION
Codesigned Virtual Machines. Shin Gyu, Kim 2006. 10. 16. Codesigned VM. Application Binary. Application Binary. Application Binary. Native ISA. Source ISA. Source ISA. Hardware. Virtual Machine. VM Software. Target ISA. Target ISA. VM Hardware. Hardware. - PowerPoint PPT PresentationTRANSCRIPT
Codesigned Virtual Machines
Shin Gyu, Kim2006. 10. 16
22
Codesigned VM
ApplicationBinary
Hardware
Native ISA
ApplicationBinary
VirtualMachine
Source ISA
Hardware
Target ISA
ApplicationBinary
VM Software
VM Hardware
Source ISA
Conventional HW/SW interface
Conventional Virtual Machine
interface
HW/SW Codesigned
Virtual Machine
Target ISA
SW becomes part of the HW
divide the implementation of HW & SW in an optimal way
33
Codesigned VM & System VM (1/2)
• Codesigned VM & System VMo Support an entire system (OS + App)
Codesigned VM has a form of system VM.o But in codesigned VMs,
• Not intended to virtualize HW resources• Not intended to support multiple VM environment.
o The Goals include performance, power efficiency, design simplicity.
44
Hardware
Codesigned VM & System VM (2/2)
Application
OS
VMM
Translator
CodeCache
Source ISA (IA32)
Target ISA (Crusoe)
o We refer to the VM SW as a VM Monitor (VMM)
Con
ceal
ed
Mem
ory
Vis
ible
Mem
ory
55
Codesigned VM & Process VM
• Codesigned VM vs. Process VMo Similarity : emulate the source ISA, dynamic translat
ion, code cacheo But in codesigned VMs,
1. Intrinsic compatibility at the ISA level (not ABI level) Both user-level & system-level ISA must be emulated.
2. Improved performance, power efficiency, design simplicity. Compatibility is just a requirement, not a motivation.
66
Codesigned VM & Superscalar processor
• Codesigned VM vs. Superscalar processoro Similarity : perform translation
• source ISA target ISAo But in codesigned VMs,
• The translation is done in SW. less cost, small size, design simplicity, much more opti
mization opportunities, low power consumption• Inter-instruction optimization is possible.
77
Code Translation Methods
instr. 1
instr. 2
instr. 3
.
.
.
instr. n
micro-op amicro-op bmicro-op cmicro-op dmicro-op e
.
.
.
micro-op pmicro-op qmicro-op r
source target
Code translation by HW
Context-free
instr. 1
instr. 2
instr. 3
.
.
.
instr. n
instr. A
instr. B
instr. C
instr. D
.
.
.
instr. M
source target
Code translation by SW
Context-sensitive
88
Contents
• Memory & Register State Mapping• Self-Modifying Code & Self-
Referencing Code• Support for Code Caching• Implementing Precise Traps• Input/Output
99
Register State Mapping
• Register state mapping is easier.o Host register files can be made larger enough
to accommodate the guest’s.
r0-r31
counter
linkreg
MQ
const. 0
R0-R31
R32
R33
R34
R35
R36-R63
PowerPC Daisy host
ScratchSpeculative Results
ConstantsPointers
1010
Memory State Mapping
• Concealed Memoryo A reserved region for VMM, code cache, other
data used by VMM.o Never visible to guest SW.
• This is possible because VMM takes control from the boot process.
o Fixed size, normally diskless (to simplify the system design)
o VMM may be stored in ROM.
1111
Concealed Memory (1)
CodeCache
VMM Code
VMM Data
Source ISA Code
Source ISA Data
ICache
DCache
ProcessorCore
Concealed Memory
Conventional Memory
• Memory system in Codesigned VMo I-cache only holds target ISA instructions
1212
Concealed Memory (2)
• Memory mapping for Concealed memory1. Concealed logical memory shares address
space with the guest• Host address space must be enlarged
ConcealedLogicalAddress
ConventionalLogicalAddress
Concealedreal
memory
Conventionalreal
memory
concealedmemorymapping
guestmemorymapping
1313
Concealed Memory (3)
• Memory mapping for Concealed memory2. Two separate logical address spaces.
• Load/Store must select the mapping.• This can be controlled by the VMM.
concealedmemorymapping
guestmemorymapping
ConcealedLogicalAddress
ConventionalLogicalAddress
Concealedreal
memory
Conventionalreal
memory
1414
Concealed Memory (4)
• Memory mapping for Concealed memory3. Use real addressing for concealed memory.
• Special case of option 2.• Separate set of Load/Store, or a mode bit.
guestmemorymapping
ConcealedReal
Address space
ConventionalLogicalAddress
Concealedreal
memory
Conventionalreal
memory
1515
Self-Modifying Code (1)
• Basically use same technique in a process VM as in Ch 3.o It is easiest Keep guest OS’s virtual-to-real page mapp
ing intacto Write-protect guest code region Any attempt to writ
e into that region will cause a trap Then VM can handle this
• But in codesigned VMs,o cannot use a system call to write-protect, because it i
s the guest OS that manages the page tables
1616
Self-Modifying Code (2)
• TLBo TLB is managed by VMMo Additional bit indicating “write-protect”.o The VMM sets write-protect bit whenever an
entry for a code page is loaded into TLB.o VMM should maintain a table of all the guest
virtual pages for translated code.
1717
Self-Modifying Code (3)
• Special hardware support.o In the Transmeta Crusoe, a special hardware
structure is added to speedup fine-grained write-protection checking.
• Goal : Find out whether this is really write to translated code region
• Virtual address (TLB) Real address (Filtered by write-protect table) write fault or not
1818
Self-Modifying Code (4)
virt.addr phys.addr 0
virt.addr phys.addr 1
.
.
.
.
.
.
virt.addr phys.addr 0
bit mask
bit mask
.
.
.
bit mask
ComparisonLogic
source addr
virt. page No.
phys. page No.
WP. bits
TLB
Write-Protect Tablehit/miss
wp bit mask
Page level write-protect fault
source code write fault
Page Offset Bits
1919
Self-Modifying Code (5)
• I/O writes to guest code memory must be caught.o For translated code in the code cache, keep
track of all the real guest pages.o Maintain a hardware table for I/O writes –
entries for all the real pages that hold guest code page.
o A store to any of these pages cause an interrupt to the VMM. Then VMM flushes the translated code.
2020
Support for Code Caching (1)
• Code cache performance is the most important.o SPC (Hash) TPC (if hit) access code cacheo Involves multiple mem access + indirect jumpo For direct jumps and branches
• Superblock chaining eliminates table lookupo But how about indirect jumps?
2121
Support for Code Caching (2)
• To reduce table lookup overhead – use SW-based jump target prediction
• But,o If SW prediction is incorrect time is wastedo Many indirect jumps are difficult to predict (ex. return
s)
• Hardware support for code caching.o JTLB (Jump Translation Lookaside Buffers)o D-RAS (Dual-address Return Address Stack)
if (Rx == #addr_1) goto #target_1else if (Rx == #addr_2) goto #target_2else map_lookup(Rx)
2222
JTLB (1)
• “a specially designed HW cache of map table entries”
SPC Hashtag TPC
tag TPC
tag TPC
TagCompare
tag
tag
MUX
hit or miss
JTLB
select
TPC
2323
JTLB (2)
• JTLB_Lookup instruction
• Lookup_Jump instruction and predictiono Predict using BTB (branch target buffer)
• JTLB hit and prediction correct OK• JTLB hit but misprediction Redirect fetch
to jump target TPC from JTLB• JTLB miss Redirect fetch to fall-through
addr.
JTLB_Lookup Ri, Rj, RkJump Ri, Rj == 0Jump map_lookup
SPChit/missTPC
2424
JTLB (1)
Tag TPC
BTB
Lookup_Jump instruction
PC
Predictedjump target
TPC
Lookup_JumpInstruction
(in pipeline)
SPC
RegisterFile
Registerindentifier
Tag TPC
JTLB
Match? Hit?
NoYes
Yes
Jumpdestination
TPC
Jumpdestination
SPC
Next predicted fetch TPC
BTB misprediction:Redirect fetch to jump target TP
C from JTLB
JTLB miss:Redirect fetch to fall-through
address
No
BTB prediction is correct
2525
D-RAS (1)
• The RAS (Return address stack) helps solving return-jump problem.o Push the fall-through PC onto a stack
• But, in codesigned VM, o We need TPC (not SPC)o If the procedure call is at the end of a
translated superblock, the return address may not be correct.
TranslationBlock A
TranslationBlock X
Call
Return
???
2626
D-RAS (2)
• A specialized dual-address RAS is used.
Opcode SPC TPC
Push_DRAS instruction
SPC TPC
.
.
.
.
.
.
Opcode SPC
Return instruction
Predicted SPC
Predicted TPC
push pop
Dual-Address Return Address Stack
2727
Implementing Precise Traps
• Similar techniques in Chapter 3, 4o Maintain SW checkpointso Code motion with extending register live rangeo Trap occurs Interpretation beginning at the
checkpoint to establish correct state
• In codesigned VM, o Enough registers live ranges can be
extended with less register pressureo Restriction of code motion is relaxed.
2828
HW Support for Checkpoints (1)
• Use HW to set a checkpoint when each translation block is entered.
TranslationBlock A
TranslationBlock B
TranslationBlock C
TranslationBlock N
set checkpoint
set checkpoint
set checkpoint
set checkpoint
2929
HW Support for Checkpoints (2)
• If a trap occurs,o HW restores the
state at the beginning of the block.
o Then interpretation is used to provide the precise exception state.
TranslationBlock A
TranslationBlock B
Source code
restore checkpoint
trap !
interpret
3030
HW Support for Checkpoints (3)
• When a new translation block is entered,o The state from the previous block is
“committed” o And a new checkpoint is set.
• Setting register checkpointo When checkpoint is set – registers are copied
to shadow registers.o When a trap occurs – copy back from shadow
registers to working registerso These copying are done very fast.
3131
HW Support for Checkpoints (4)
• Checkpointing memoryo Gated store buffer
• Store operations are bufferedo Until the current translation block is exited
(committed)o If an exception occurs, the buffered stores are
flushedo Restrictions on code motion are relaxed.
• The code inside a translation block can be reordered by software in any fashion.
o Fixed size of store buffer constrain the translation block size.
3232
HW Support for Checkpoints (5)
Guest regs
ScratchSpeculative Results
Constants
shadow Guest regs
ScratchSpeculative Results
Constants
shadow
When checkpoint is committed
When trap is detected
3333
Page Fault Compatibility (1)
• Guest OS must observe exactly the same page fault as on a native platform.
• If guest OS manages conventional memoryo Page fault for data region will be detected
naturallyo During interpretation, page fault for code
region will also be detected.o But executing translation code does not fetch
any code from the guest memory
3434
Page Fault Compatibility (2)
• When a translated instruction is fetched from the code cache, we trigger a page fault, ifo the corresponding guest instruction would
have caused a page fault on a native platform.
• Two approacheso Active approacho Lazy approach
3535
Active Page Fault Detection (1)
• Monitor potential page replacement by the guest OS.o Assuming architected page table, VMM can
identify the mem region of page table.o VMM monitors the guest OS’s modification to
the architected page table.o By write-protecting the page table, VMM can
monitor any change of a virtual page mapping.o VMM keeps a table for : in which virtual pages
each source instructions is contained
3636
Active Page Fault Detection (2)
• If the page table is modified,o VMM flushes all the translations in the code
cache derived from that (modified) page.o Table 1 - Each source page : all the translation
block (must-be-flushed blocks)o Table 2 – keep track of any link backpointers
• links (for removed pages) are changed to point VMM emulation manager.
• emulation process will detect the instruction page fault.
3737
Lazy Page Fault Detection (1)
• Code cache flushing is postponed until actual use of the replaced code.o Every time the translated code crosses a
source page boundary, check the page table.o At the time crossing the boundary, Verify_Translation instruction is inserted.
o It checks the page mapping • page mapped correctly proceed • page not mapped page fault
3838
Lazy Page Fault Detection (2)
ABC
DE
FG
HI
J
K
L
ABC
DE
FG
HI
J
K
L
ABC
DE
FG
HIJ
KL
Probe page tablePage
correctlymapped?
Yes
No Jump to VMM
continue execution
Guest Pages Code Cache
Verify_Translationinstruction
3939
Input/Output (1)
• If the VMM does not use any I/O,o All the guest device drivers can be run as is.o Any I/O instructions or memory mapped I/O is
simply passed through.
• Volatile memory inhibit optimization. So we need to identify access to the volatile memory.o Use access-protect bit : load/store to that page
trap deoptimize for correct sequence.o Special volatile version of load/store
4040
Input/Output (2)
• Using disk in VMMo for disk-based code
cache approach – large, persistent code cache
o requires relaxed transparency
o “concealed secondary storage”
o VMM-aware special disk driver
Guest OS
VMM
SpecialDisk Driver
ConcealedDisk region