operating system support for planetary scale network service
TRANSCRIPT
Operating System Support Operating System Support for Planetary Scale Network for Planetary Scale Network
ServiceService
Andy Bavier, Larry Peterson Andy Bavier, Larry Peterson
PlanetLabPlanetLab• PlanetLab is a geographically distributed overlay netw
ork designed to support the deployment and evaluation of planetary-scale network services.
• 580 machines spanning 275 sites and 30 countries nodes within a LAN-hop of > 2M users
• Supports distributed virtualization each of 425 network services running in their own slice
• PlanetLab services and applications run in a slice of the platform: a set of nodes on which the service receives a fraction of each node’s resources, in the form of a virtual
machine (VM)
PlanetLabPlanetLab
Denver
Seattle
Sunnyvale
LA
San Diego
Chicago Pitts
Wash DC
Raleigh
Jacksonville
Atlanta
KC
Baton Rouge
El Paso - Las Cruces
Phoenix
Pensacola
Dallas
San Ant.Houston
Albuq. Tulsa
New YorkClev
SlicesSlices
SlicesSlices
SlicesSlices
Per-Node ViewPer-Node View
VMM: Linux++
NodeMgr
LocalAdmin
VM1 VM2 VMn…
Requirements of PlanetLabRequirements of PlanetLab
• Distributed VirtualizationDistributed Virtualization– Isolating SlicesIsolating Slices– Isolating PlanetLabIsolating PlanetLab
• Unbundled ManagementUnbundled Management– Minimize the functionality subsumed by the PlanetLab OSMinimize the functionality subsumed by the PlanetLab OS– Maximize the opportunity for services to compete with each oMaximize the opportunity for services to compete with each o
ther on a level playing field, the interface between the OS and ther on a level playing field, the interface between the OS and these infrastructure services must be sharable, and hence, withese infrastructure services must be sharable, and hence, without special privilege.thout special privilege.
Isolating SlicesIsolating Slices
• OS requirement from slices isolationOS requirement from slices isolation– It must allocate and schedule node resources (cycles,bandwi
dth, memory, and storage) so that the runtime behavior of one slice on a node does not adversely affect the performance of another on the same node.
– It must either partition or contextualize the available name spaces (network addresses, file names, etc.) to prevent a slice interfering with another, or gaining access to information in another slice.
– It must provide a stable programming base that cannot be manipulated by code running in one slice in a way that negatively affects another slice.
Isolating PlanetLabIsolating PlanetLab
• It must thoroughly account resource usage, and make it possible to place limits on resource consumption so as to mitigate the damage a service can inflict on the Internet.
• It must make it easy to audit resource usage, so that actions (rather than just resources) can be accounted to slices after the fact.
PlanetLab OSPlanetLab OS
• Node VirtualizationNode Virtualization
• Isolation and Resource AllocationIsolation and Resource Allocation
• Network VirtualizationNetwork Virtualization
• MonitoringMonitoring
Node VirtualizationNode Virtualization
• Low Level VirtualizationLow Level Virtualization– Full Hypervisors like VMwareFull Hypervisors like VMware
• Cons: Performance and ScalabilityCons: Performance and Scalability– Paravirtualization like XenParavirtualization like Xen
• Scalability and immaturityScalability and immaturity
• System call level virtualization, such as UML (USystem call level virtualization, such as UML (User Mode Linux), Linux Vserversser Mode Linux), Linux Vservers– Good performanceGood performance– Reasonable assurance of isolationReasonable assurance of isolation
Node VirtualizationNode Virtualization
• PlanetLab’s implementationPlanetLab’s implementation– Linux Vserver which is a system call level virtualization. Linux Vserver which is a system call level virtualization. Vserv
ers are the principal mechanism in PlanetLab for providing virare the principal mechanism in PlanetLab for providing virtualization on a single node, and contextualization of name stualization on a single node, and contextualization of name spaces; e.g.,user identifiers and filespaces; e.g.,user identifiers and files
– Chroot utility, which is used to provide file system isolation. It Chroot utility, which is used to provide file system isolation. It extends the non-reversible isolation provided by chroot for filextends the non-reversible isolation provided by chroot for filesystems to other operating system resources, such as procesesystems to other operating system resources, such as processes and SysV IPC.ses and SysV IPC.
Node VirtualizationNode Virtualization
• DrawbacksDrawbacks– Weaker Guarantees on IsolationWeaker Guarantees on Isolation– Challenges for eliminating QoS crosstalkChallenges for eliminating QoS crosstalk
Isolation and Resource Isolation and Resource AllocationAllocation• The node manager provides a low-level interface for
obtaining resources on a node and binding them to a local VM that belongs to some slice.
• a bootstrap brokerage service running in a privileged slice implements the resource allocation policy.
• Non-renewable resources, such as memory pages, disk space, and file descriptors, are isolated using per-slice reservations and limits.
• For renewable resources such as CPU cycles and link bandwidth, the OS supports two approaches to providing
isolation: fairness and guarantees.
Isolation and Resource Isolation and Resource AllocationAllocation• Fairness and guarantees
– The Hierarchical Token Bucket (htb) queuing discipline of the Linux Traffic Control facility (tc) is used to cap the total outgoing bandwidth of a node, cap per-vserver output, and to provide bandwidth guarantees and fair service among vservers.
– CPU scheduling is implemented by the SILK kernel module, which leverages Scout [28] to provide vservers with CPU guarantees and fairness.
– PlanetLab’s CPU scheduler uses a proportional sharing (PS) scheduling policy to fairly share the CPU. It incorporates the resource container abstraction and maps each vserver onto a resource container that possesses some number of shares.
Network VirtualizationNetwork Virtualization
• The PlanetLab OS supports network virtualization by providing a “safe” version of Linux raw sockets that services can use to send and receive IP packets without root privileges.
• It intercepts all incoming IP packets using Linux’s netfilter interface and demultiplexes each to a Linux socket or to a safe raw socket.
Network VirtualizationNetwork Virtualization
• SendSend– SILK intercepts the packet and does the security checSILK intercepts the packet and does the security chec
kk– If the packet passes these checks, it is handed off to tIf the packet passes these checks, it is handed off to t
he Linux protocol stack via the standard raw socket she Linux protocol stack via the standard raw socket sendmsg routine.endmsg routine.
• ReceiveReceive– Those packets that demultiplex to a Linux socket are Those packets that demultiplex to a Linux socket are
returned to Linux’s protocol stack for further procereturned to Linux’s protocol stack for further processing;ssing;
– those that demultiplex to a safe raw socket are placethose that demultiplex to a safe raw socket are placed directly in the per-socket queue maintained by SILd directly in the per-socket queue maintained by SILK.K.
MonitoringMonitoring
• Defined a low-level sensor interface for uniformly exporting data from the underlying OS and network, as well as from individual services.
• Sensor semantics are divided into two types: snapshot and streaming.– Snapshot sensors maintain a finite size table of tuples, and im
mediately return the table (or some subset of it) when queried.
– Streaming sensors follow an event model, and deliver their data asynchronously, a tuple at a time, as it becomes available.
EvaluationEvaluation
• Vserver ScalabiltyVserver Scalabilty– The scalability of vservers is primarily determined b
y disk space for vserver root filesystems and service-specific storage. 508MB of disk is required for each VM. Copy on write technology is adopted by PlanetLab OS to reduce disk storage space
– Kernel resource limits are a secondary factor in the scalability of vservers. Re-configure Kernel option and then recompile kernel is an option
EvaluationEvaluation
• Slice CreationSlice Creation– the vserver creation and initialization takes an addi
tional 9.66 seconds on average• Slice IntializationSlice Intialization
– it takes on average 11.2 seconds to perform an empty update on a node;
– When a new Sophia “core” package is found and needs to be upgraded, the time increases to 25.9 seconds per node.
– When run on 180 nodes, the average update time for the whole system (corresponding to the slowest node) is 228.0 seconds.
DiscussionDiscussion
• Instead of high level virtualization, can sInstead of high level virtualization, can some low level virtualization be deployed ome low level virtualization be deployed in PlanetLab OS? in PlanetLab OS? – Performance?Performance?– Scalability? Scalability? – Isolation?Isolation?– Security?Security?– Mature?Mature?
ConclusionConclusion
• PlanetLab OS supports distributed virtualization and unbundled management.
• PlanetLab OS providing only local (per-node) abstractions
• as much global (network-wide) functionality as possible pushed onto infrastructure services running in their own slices.
Q&AQ&A
Scout & SILKScout & SILK
• Scout is a modular, configurable, communicatScout is a modular, configurable, communication oriented operating system developed for sion oriented operating system developed for small network appliances. Scout was designed mall network appliances. Scout was designed around the needs of data-centric applications around the needs of data-centric applications with particular attention given to networking.with particular attention given to networking.
• SILK stands for Scout In the Linux kernel and is SILK stands for Scout In the Linux kernel and is a port of the scout operating system to run as a port of the scout operating system to run as a Linux kernel module.a Linux kernel module.
Scout DesignScout Design
• Early DemultiplexingEarly Demultiplexing– This allows the system to isolate flows as early as possible, in order t
o prioritize packet processing and accurately account for resources.• Early DroppingEarly Dropping
– when flow queues are full. The server can avoid overload by dropping packets before investing many resources in them.
• AccountingAccounting– Accounting of the resources used by each data flow, including CPU,
memory, and bandwidth.• Explicit SchedulingExplicit Scheduling
– Scheduling and accounting are combined to provide resource guarantees to flows;
• ExtensibilityExtensibility– This makes it easy to add new protocols and construct new network
services
SILK Network Path SILK Network Path
QoS CrosstalkQoS Crosstalk• Would like to “reserve” some amount of resource (e.
g., CPU time) for each application so it can meet its QoS requirements.
• Problem: Not all CPU time is properly accounted for!– e.g., When TCP stack is processing incoming packets, the CPU
time is being used by the kernel, not any specific application– Same goes for handling file system requests, disk I/O, page fa
ults, etc.• It is sometimes very hard to tell which app the OS is doing work f
or!• Result: Activity of one application can impact perform
ance of another– This is called QoS Crosstalk– Applications are not properly isolated from one another
Linux raw socketLinux raw socket
• The basic concept of low level sockets is The basic concept of low level sockets is to send a single packet at one time, with to send a single packet at one time, with all the protocol headers filled in by the pall the protocol headers filled in by the program (instead of the kernel). rogram (instead of the kernel).