oleg goldshmidt - the eye · cooperative linux (colinux) a port of linux kernel that allows it to...
TRANSCRIPT
Computers and Operating Systems
the familiar picture: one computer, one OS
multiprocessor picture: one CPU, one OSrecall, e.g., the master-slave configuration
multicomputer picture: one node, one OS
SMP picture: many CPUs, one OS
where is: one CPU, many OSes?
Can we run more than one OS on a single CPU (or,more generally, on a single computer, e.g., several SMPOSes on an SMP machine) simultaneously? And whywould we want to do it?
– p.2/43
Server Consolidation
IT needs of enterprizes growmore computing powermore new servers with different roles, disparateoperating environmentsvery high acquisition, provisioning, maintenancecosts
servers are provisioned on a “just in case” basis, thusunderutilized
workloads change
granularity of resource control is low
– p.3/43
Server Consolidation II
example: IT infrastructure servers (file servers, printservers, directory infrastructure servers, firewalls, NAT,DNS, DHCP, etc.)
typical utilization:
� � � �� �
architectural, security, compatibility issues lead todeployment on separate machines
hardware expensesoperational expensesspace, facilities, power, cooling, etc.
example: mission critical serverstypical utilization:
� � � � � �
varying workload: today need more web servershere, tomorrow need more DB servers there, etc.maintenance, high availability, recovery time
– p.4/43
Desktop Consolidation
typical desktop utilization:
� �
today: investment bank branch200 users with different requirementsIT operation, personnel, etc to maintain 200desktops + spares“I need more memory, faster CPU, bigger disk”
temporarily
tomorrowusers only have peripheral equipment (KVM)one physical platform runs 4 different operatingenvironments50 consolidated machines + 5 sparesnew environment is created as need arises
– p.5/43
Capacity Optimization
treat the available hardware as a single resource pool
flexible capacity planning
provisioning to optimize resource utilization
“just in time” provisioning
moving functionality around the resource pool
dealing with dynamic workloads
– p.6/43
Testing, Upgrading, Debugging
test multiple operating environments without the needfor separate physical platforms
e.g., test a new version of software (including OSkernels) before upgrading
live update: shut down an old version and bring up anew one while the rest of the environment keepsworking
bring up a new version alongside the old one andswitch!
rollback: shut down the new version and restore the oldone if needed
debugging does not affect the rest of the system
– p.7/43
Research and Development
what happens if there is a serious bug in your brandnew OS?
your computer freezes, catches fire, or becomesgenerally unusable
how do you debug a running OS?you do it from a different computer connected to thetarget via, e.g., a serial cable
wouldn’t it be great ifyou could develop, test, debug a new OS using onlyone computer?a bug would not render your computer unusable?you had control over the OS behaviour from theoutside?
– p.8/43
Security
a virtual machine provides an isolated environment thatcan be destroyed without permanent damage to thesystem
recent idea: run applications such as Internet Explorerin a virtual machine
even if the machine gets infected with a virus, or0WNed, if it is detected in time it can be broughtdown and restarted afresh without affecting the restof the systemprovided there is appropriate protection for thepermanent storage
– p.9/43
Using The Best Tools
I want more then one OS on my machine!simultaneously!
my main development environment is Linux
I may need Windows for specific applications or taskswithout rebooting
terminal server solutions (e.g., Citrix) are not alwaysavailable
need to be on the network
example: there is some problem with using Linux withour projector, so I use Windows — but I’d like todemonstrate some things on Linux during the lecture!
– p.10/43
A Bit Of History
from plex86 FAQ (by Kevin Lawton):Q: Who came up with the idea for this so that I canleave all my worldly posessions to them when I passon?A: IBM essentially. They’ve been doing this since the1960’s. I believe the concept of having a virtualizablesystem, and the HAL idea is theirs.
the idea of virtualization goes back to the 60ies, toSystem 360 and 370: each user got his/her ownoperating environment, hardware was partitioned, I/Oabstractions provided to each OS.
mainframes are very much alive now, run most of theworld’s big businesses, and are popular as virtual Linuxfarms
– p.11/43
So What Seems To Be The Problem?
the problem is that OS likes to think it is the sole ownerof all the system’s resources
resource management is a major OS task, and one ofthe main resons for OS existence
OS manages sharing the system resources betweenprocesses
kernel (system) space vs. user spaceIntel CPUs have a special register2 bits — 4 protection ringsonly 2 are used: “ring 0” for kernel, “ring 3” foruserspaceno OS used rings 1 or 2 since the late OS/2
– p.12/43
What Are the Options?
cooperative virtual machine: multiple OSes share ring 0
emulation: run one host in ring 0, other OSes in ring 3(as applications)
usually slow — not native execution
para-virtualization: add a protection level, e.g., run a“hypervisor” in ring 0, OSes (“partitions”) — in ring 1,applications — in ring 3
good performanceneed to modify the OS!
binary translation (full virtualization): try to figure outwhat an OS does, change it on the fly
significant performance overheadlots of headaches you would not have otherwise
– p.13/43
Cooperative Linux (coLinux)
a port of Linux kernel that allows it to run alongsideother operating systems (e.g., Windows) on a singlemachine (135K patch, compile time parameter)
runs in the driver context of another OS
runs at the same protection level as the host kernel
the driver gives the guest kernel a fixed amout ofmemory
guest kernel has its own page tables, etc.
there is a “passage page” shared between the hostdriver and the guest kernel — 4K of host and guestCPU state
– p.14/43
Cooperative Linux II
in the host OS, a userspace daemon ioctl()’s the driver,the driver switches to guest while preservinf the fullCPU state
when the guest needs data from the host or an interruptoccurs the driver switches back
drawbacks: stability and security
measures can be taken to gracefully shut down on thefirst “oops” or panic
the two kernels must be trusted and cooperative
an attack on one of them leads to a compromise of theother
more details: http://www.colinux.org
– p.15/43
Emulation vs. Virtualization
emulation: providing the functionality of the targethardware completely in software
you can emulate an architecture using a completelydifferent architecturetends to be slow
virtualization: partitioning real hardware into multiplecontexts, which (usually) timeshare the real hardware
typically faster than emulationneed the right type of hardware
a lot of confusion, especially since both words are sooverloaded
– p.16/43
Hardware Emulation
emulate the computer hardware in an application
run an application that emulates the hardware, run anOS and an application stack inside this application
can be used for cross-platform development
user mode emulationused to run processes compiled for one architectureon a different architecture
full system emulationused to emulate a full system used to run an OS andapplications
– p.17/43
Hardware Emulation: bochs
emulates an x86 PC386, 486, Pentium, PentiumPro, AMD64, includingMMX, SSE, etc.common I/O devices: disk, floppy, CD, NIC, etc.includes BIOS and VGA BIOS
runs on UNIX, Linux, Windows, MacOS X, BeOS, OS/2,etc.
runs Linux, DOS, Windows (95, NT), MacOS X
a C++ application providing a faithful HW emulation
the BIOS and OS are stored in files on host
emulates every instruction — slow
more details: http://bochs.sourceforge.net/
– p.18/43
Hardware Emulation: QEMU
open source CPU emulator
full system emulation for x86, x86_64, PowerPC
user mode emulation for Linux for x86, x86_64,PowerPC, SPARC
Accelerator Module for PC — runs most of the targetcode directly on the host CPU to achieve goodperformance
lots of OSes, including Linux, Windows, QNX, NetBSD,L4, Solaris, Minix, etc.
slowdownwithout acceleration: factor of 5 to 10with acceleration: factor of 1 to 2
– p.19/43
QEMU: Dynamic Translation
QEMU converts target code to host instruction setsplits, e.g., x86 instructions into fewer simplerinstructionseach simple instruction is implemented in Ca compile time tool (dyngen ) dynamically generatescode by concatenating simple instructions
a lot of optimizations — condition code, translationcache
MMU emulation via mmap()
QEMU can even emulate itself (self-virtualization)
more details:http://fabrice.bellard.free.fr/qemu/
– p.20/43
CPU Virtualization
we need a “hypervisor” (a.k.a. “virtual machinemonitor”) to control the many OSes running on a singlesystem
“one ring to rule them all...”
to maintain complete control, the OS must be kept outof ring 0
if one of the OSes can mess with the hypervisor itwill defeat the original purpose
but OSes are used to running in ring 0 and havingcomplete control themselves!
the 0/1/3 model: change the OS so it plays nice
the 0/3 model: push the OS to ring 3, but do everythingelse as if it were 0/1/3 (differently from emulation)
– p.21/43
CPU Virtualization Woes I
problem: some instructions will only work from ring 0
bigger problem: some instructions will behave “oddly” ifissued from a wrong ring
long-term problem: modern architectures (x86_64) tendto “clean up” by getting rid of rings 1 and 2, under theassumption that no one cares
but virtualization folks do!
categories of “odd behaviour”instructions that check which ring they are ininstructions that do not save the CPU state correctlywhen in a wrong ringinstructions that do not fault when they shouldinstructions that fault when they should not
– p.22/43
CPU Virtualization Woes II
imagine an OS checking which ring it is init finds it is in ring 1, not ring 0returns 1, not 0 — treated as (severe?) errorBSOD, panic, core dump, or other unpleasantriesbinary translation can notice and fake 0, but it willtake lots of instructions rather than one —performance hit
saving CPU statesome CPU registers are not easy to save on contextswitch once loaded into memorymemory-resident portions and the real CPU stateare not the sameworkarounds are tricky and expensive
– p.23/43
CPU Virtualization Woes III
things that do not fail when they shouldnothing happens when you expect a faultyou cannot catch the fault when you want towhat if it something at protection boundary?
things that fail when they should notexample: writing to CR0 or CR4 (on Intel)
CR0: cache control, paging control, taskswitching...e.g., on every HW context switch the TS flag is setin CR0will fault if not performed in the correct ringeither crash or lots of effort and overhead
– p.24/43
Para-Virtualization: Xen
http://www.cl.cam.ac.uk/netos/papers/2003-xensosp.pdf
– p.25/43
Control Mechanisms
hypercallssynchronous calls from a guest OS (“domain,”“partition”) to the hypervisoranalogous to system calls in regular OSessynchronous software trap into the hypervisor toperform a priviledged operationexample: request for page table updates
asynchronous eventsreplace device interrupts on real hardwareexamples: data received over network, virtual diskrequest has been completedevents can be deferred (like disabling interrupts)quite similar to UNIX signals, handlers
– p.26/43
Memory Virtualization: TLB
software-managed TLB is relatively easy
tagged TLB (supported by Alpha, MIPS, SPARC, andother RISC arcitectures) also helps
an address space tag is associated with each TLBentrythe hypervisor and the OSes co-exist in separateaddress spacesno need to flush the entire TLB when execution istransferred
Intel did it “wrong”:TLB is hardware-managed by walking through HWpage tables in case of missTLB is not tagged, so it must be flushed at addressspace switch
– p.27/43
Memory Virtualization: Page Tables
shadow page tables approach: each OS keeps a set ofpage tables distinct from hardware
hypervisor is responsible for trapping updates,validation, updating the real hardware page tables
direct page table access: each OS has read-onlyaccess to the real page tables
hypervisor ensures isolation and protection
– p.28/43
Memory Virtualization In Xen
hypervisor is mapped into the top 64MB of everyaddress space, so no TLB flushes when entering andexiting
guest OSes register the page tables with Xen
each OS allocates and initializes pages from its ownmemory reservation
all subsequent write access must be validated by XenOS can map only pages it ownsno writeable mappings of page tableXen’s 64MB is not mappable (not required by anyx86 ABI)update requests to Xen may be batched
– p.29/43
Segmentation Virtualization In Xen
segmentation support is required by thread librarieswith TLS (Thread-Local Storage) support
similar approach to virtualization of hardware segmentdescriptor tables
hypervisor validates updates
Xen itself is protected by a segment, so there can be nosegments that overlap with it
NPT (New POSIX Thread library) TLS usessegmentation in a way that conflicts with Xen
Xen needs binary rewriting to cope
– p.30/43
Page Frame Types In Xen
each machine page frame has a type and a referencecount
page types are mutually exclusive:PD — page directoryPT — page tableGDT — global descriptor tableLDT — local descriptor tableRW — writeable
e.g., writeable mappings are disallowed: a page framecannot be PT and RW simultaneously
a page frame may be retasked only if its referencecount is 0
– p.31/43
Page Validation And Pinning In Xen
OS indicates when a page is allocated for page tableuse
page frame type is pinned as PD or PT, asappropriateuntil a subsequent unpin request from the OS
safety check: a page cannot be retasked until it is bothunpinned and its reference count is reduced to 0
the hypervisor does not trust the guest OSes, andwants to protect against circumventing the referencecounting mechanism by unpinning a used page
– p.32/43
Processing Page Table Updates
OSes can queue update requests locallyvery useful optimization, e.g., when creating a newaddress spacerequests must be committed early enough forcorrecness
OS flushes TLB before using a new mappingto ensure that cached translations are invalidatedcommitting pending updates before the flush will beenough to ensure correctness
OSes doesn’t flush TLB if there are no stale entriesfirst use of new mapping will generate a page faultthe page fault handler must check for outstandingupdates before retrying the faulting instruction
– p.33/43
Ballooning
most OSes happily utilize all the memory they havethey don’t release unused memory until someprocess needs it
we would like to manage memory dynamicallyincluding taking memory from a guest OS when it isneeded elsewhere
“balloon driver”:tells the OS it needs so many MB of memoryOS allocates the memory which is not really usedthe rest of the domain makes do with a smalleramount of memorythe hypervisor knows it can give the memoryallocated to but unused by the balloon driver toanother domain
– p.34/43
I/O Virtualization In Xen
Xen starts a priviledged control domain (dom0) that cantouch all hardware devices in the system
dom0 exports subsets of devices to other, “guest”domains (domU)
mechanism: asynchronous shared memorychannelseven rings for interrupt handling
dom0 runs a “backend”, exports “frontends” to domUhigh level interface: block class, network class
virtual PCI configuration space, virtual interrupts
domU may be granted physical device access, securely
– p.35/43
Time, Timers, And Scheduling In Xen
real time: nanoseconds since boot of physical machinemaintained to the accuracy of the real timercan be synced with a remote source (NTP)
virtual time: advances only while a domain is executingto ensure fair scheduling between the domain’sprocesses
wall-clock time: offset to be added to real timecan be adjusted without affecting the flow of real time
each guest OS programs a pair of alarm timers: realand virtual
there are numerous schedulers to time-share betweendomains
– p.36/43
Full Virtualization: VMWare
what if we must run unmodified OS (e.g., Windows)?
use full virtualization:employ binary code translationdetect privileged instructions executed by a guestOS, change them on the fly
can do what para-virtualization cannot
suffers from performance degradation, because of theoverhead of the virtualization mechanism
– p.37/43
VMWare Workstation and GSX
http://www.vmware.com/products/server/gsx_features.html
– p.38/43
VMWare ESX Server
http://www.vmware.com/products/server/esx_features.html
– p.39/43
VMWare vs. Xen (or Full vs. Para)
VMWare: 0/3, Xen: 0/1/3
Xen is quite a bit faster:Xen claim 97% of native performanceVMWare: slowdown of dozens of per cent
VMWare: shadow page tables, Xen: direct page tableaccess
balloon support — both
migration supported — both
no modifications of application stack
VMWare does not modify guest OSclosed source, licensing issuese.g., Xen cannot run Windows as guest OS
– p.40/43
Hardware Support I
how to avoid modifying guest OS and avoid thecomplexity and performance problems of fullvirtualization?
support virtualization in hardware!
keep the OS in ring 0, add another protection ring ( � �
?)below it
solves all problems
Intel’s VT-x (for x86) and VT-i (Itanium) supportannounced
preparing release of desktop-grade CPUs this year
AMD’s Pacifica — full spec released on 25/05/2005
– p.41/43
Hearsay: Some Pacifica Features
tagged TLB
multiple device domains (up to 4 per Northbridge)in each device domain the physical pages devicescan DMA to can be limited via a bitmap
nested pages (optional)guest translates guest virtual addresses to guestphysical addressesthe host completes the translation from guestphysical to host (real) physical addresswhen a translation is needed the hardware isallowed to consult both page tables and store theend-to-end mapping from guest virtual to realphysical address in the TLB
– p.42/43
Hardware Support II
Xen has support for both VT and Pacifica built in(partial) specs have been available for quite a whilewill be able to run Windows as guest “real soon now”
other architectures (e.g., Power) have had it sinceforever
trivia: Apple’s G5 is Power 970 with virtualizationsupport disabled
catch up with VMWare on OS support, beat it onperformance?
– p.43/43