operating system support for planetary scale network service

Operating System Support Operating System Support for Planetary Scale Network for Planetary Scale Network

ServiceService

Andy Bavier, Larry Peterson Andy Bavier, Larry Peterson

PlanetLabPlanetLab• PlanetLab is a geographically distributed overlay netw

ork designed to support the deployment and evaluation of planetary-scale network services.

• 580 machines spanning 275 sites and 30 countries nodes within a LAN-hop of > 2M users

• Supports distributed virtualization each of 425 network services running in their own slice

• PlanetLab services and applications run in a slice of the platform: a set of nodes on which the service receives a fraction of each node’s resources, in the form of a virtual

machine (VM)

PlanetLabPlanetLab

Denver

Seattle

Sunnyvale

LA

San Diego

Chicago Pitts

Wash DC

Raleigh

Jacksonville

Atlanta

KC

Baton Rouge

El Paso - Las Cruces

Phoenix

Pensacola

Dallas

San Ant.Houston

Albuq. Tulsa

New YorkClev

SlicesSlices

Per-Node ViewPer-Node View

VMM: Linux++

NodeMgr

LocalAdmin

VM1 VM2 VMn…

Requirements of PlanetLabRequirements of PlanetLab

• Distributed VirtualizationDistributed Virtualization– Isolating SlicesIsolating Slices– Isolating PlanetLabIsolating PlanetLab

• Unbundled ManagementUnbundled Management– Minimize the functionality subsumed by the PlanetLab OSMinimize the functionality subsumed by the PlanetLab OS– Maximize the opportunity for services to compete with each oMaximize the opportunity for services to compete with each o

ther on a level playing field, the interface between the OS and ther on a level playing field, the interface between the OS and these infrastructure services must be sharable, and hence, withese infrastructure services must be sharable, and hence, without special privilege.thout special privilege.

Isolating SlicesIsolating Slices

• OS requirement from slices isolationOS requirement from slices isolation– It must allocate and schedule node resources (cycles,bandwi

dth, memory, and storage) so that the runtime behavior of one slice on a node does not adversely affect the performance of another on the same node.

– It must either partition or contextualize the available name spaces (network addresses, file names, etc.) to prevent a slice interfering with another, or gaining access to information in another slice.

– It must provide a stable programming base that cannot be manipulated by code running in one slice in a way that negatively affects another slice.

Isolating PlanetLabIsolating PlanetLab

• It must thoroughly account resource usage, and make it possible to place limits on resource consumption so as to mitigate the damage a service can inflict on the Internet.

• It must make it easy to audit resource usage, so that actions (rather than just resources) can be accounted to slices after the fact.

PlanetLab OSPlanetLab OS

• Node VirtualizationNode Virtualization

• Isolation and Resource AllocationIsolation and Resource Allocation

• Network VirtualizationNetwork Virtualization

• MonitoringMonitoring

Node VirtualizationNode Virtualization

• Low Level VirtualizationLow Level Virtualization– Full Hypervisors like VMwareFull Hypervisors like VMware

• Cons: Performance and ScalabilityCons: Performance and Scalability– Paravirtualization like XenParavirtualization like Xen

• Scalability and immaturityScalability and immaturity

• System call level virtualization, such as UML (USystem call level virtualization, such as UML (User Mode Linux), Linux Vserversser Mode Linux), Linux Vservers– Good performanceGood performance– Reasonable assurance of isolationReasonable assurance of isolation


• PlanetLab’s implementationPlanetLab’s implementation– Linux Vserver which is a system call level virtualization. Linux Vserver which is a system call level virtualization. Vserv

ers are the principal mechanism in PlanetLab for providing virare the principal mechanism in PlanetLab for providing virtualization on a single node, and contextualization of name stualization on a single node, and contextualization of name spaces; e.g.,user identifiers and filespaces; e.g.,user identifiers and files

– Chroot utility, which is used to provide file system isolation. It Chroot utility, which is used to provide file system isolation. It extends the non-reversible isolation provided by chroot for filextends the non-reversible isolation provided by chroot for filesystems to other operating system resources, such as procesesystems to other operating system resources, such as processes and SysV IPC.ses and SysV IPC.


• DrawbacksDrawbacks– Weaker Guarantees on IsolationWeaker Guarantees on Isolation– Challenges for eliminating QoS crosstalkChallenges for eliminating QoS crosstalk

Isolation and Resource Isolation and Resource AllocationAllocation• The node manager provides a low-level interface for

obtaining resources on a node and binding them to a local VM that belongs to some slice.

• a bootstrap brokerage service running in a privileged slice implements the resource allocation policy.

• Non-renewable resources, such as memory pages, disk space, and file descriptors, are isolated using per-slice reservations and limits.

• For renewable resources such as CPU cycles and link bandwidth, the OS supports two approaches to providing

isolation: fairness and guarantees.

Isolation and Resource Isolation and Resource AllocationAllocation• Fairness and guarantees

– The Hierarchical Token Bucket (htb) queuing discipline of the Linux Traffic Control facility (tc) is used to cap the total outgoing bandwidth of a node, cap per-vserver output, and to provide bandwidth guarantees and fair service among vservers.

– CPU scheduling is implemented by the SILK kernel module, which leverages Scout [28] to provide vservers with CPU guarantees and fairness.

– PlanetLab’s CPU scheduler uses a proportional sharing (PS) scheduling policy to fairly share the CPU. It incorporates the resource container abstraction and maps each vserver onto a resource container that possesses some number of shares.

Network VirtualizationNetwork Virtualization

• The PlanetLab OS supports network virtualization by providing a “safe” version of Linux raw sockets that services can use to send and receive IP packets without root privileges.

• It intercepts all incoming IP packets using Linux’s netfilter interface and demultiplexes each to a Linux socket or to a safe raw socket.

Network VirtualizationNetwork Virtualization

• SendSend– SILK intercepts the packet and does the security checSILK intercepts the packet and does the security chec

kk– If the packet passes these checks, it is handed off to tIf the packet passes these checks, it is handed off to t

he Linux protocol stack via the standard raw socket she Linux protocol stack via the standard raw socket sendmsg routine.endmsg routine.

• ReceiveReceive– Those packets that demultiplex to a Linux socket are Those packets that demultiplex to a Linux socket are

returned to Linux’s protocol stack for further procereturned to Linux’s protocol stack for further processing;ssing;

– those that demultiplex to a safe raw socket are placethose that demultiplex to a safe raw socket are placed directly in the per-socket queue maintained by SILd directly in the per-socket queue maintained by SILK.K.

MonitoringMonitoring

• Defined a low-level sensor interface for uniformly exporting data from the underlying OS and network, as well as from individual services.

• Sensor semantics are divided into two types: snapshot and streaming.– Snapshot sensors maintain a finite size table of tuples, and im

mediately return the table (or some subset of it) when queried.

– Streaming sensors follow an event model, and deliver their data asynchronously, a tuple at a time, as it becomes available.

EvaluationEvaluation

• Vserver ScalabiltyVserver Scalabilty– The scalability of vservers is primarily determined b

y disk space for vserver root filesystems and service-specific storage. 508MB of disk is required for each VM. Copy on write technology is adopted by PlanetLab OS to reduce disk storage space

– Kernel resource limits are a secondary factor in the scalability of vservers. Re-configure Kernel option and then recompile kernel is an option

EvaluationEvaluation

• Slice CreationSlice Creation– the vserver creation and initialization takes an addi

tional 9.66 seconds on average• Slice IntializationSlice Intialization

– it takes on average 11.2 seconds to perform an empty update on a node;

– When a new Sophia “core” package is found and needs to be upgraded, the time increases to 25.9 seconds per node.

– When run on 180 nodes, the average update time for the whole system (corresponding to the slowest node) is 228.0 seconds.

DiscussionDiscussion

• Instead of high level virtualization, can sInstead of high level virtualization, can some low level virtualization be deployed ome low level virtualization be deployed in PlanetLab OS? in PlanetLab OS? – Performance?Performance?– Scalability? Scalability? – Isolation?Isolation?– Security?Security?– Mature?Mature?

ConclusionConclusion

• PlanetLab OS supports distributed virtualization and unbundled management.

• PlanetLab OS providing only local (per-node) abstractions

• as much global (network-wide) functionality as possible pushed onto infrastructure services running in their own slices.

Q&AQ&A

Scout & SILKScout & SILK

• Scout is a modular, configurable, communicatScout is a modular, configurable, communication oriented operating system developed for sion oriented operating system developed for small network appliances. Scout was designed mall network appliances. Scout was designed around the needs of data-centric applications around the needs of data-centric applications with particular attention given to networking.with particular attention given to networking.

• SILK stands for Scout In the Linux kernel and is SILK stands for Scout In the Linux kernel and is a port of the scout operating system to run as a port of the scout operating system to run as a Linux kernel module.a Linux kernel module.

Scout DesignScout Design

• Early DemultiplexingEarly Demultiplexing– This allows the system to isolate flows as early as possible, in order t

o prioritize packet processing and accurately account for resources.• Early DroppingEarly Dropping

– when flow queues are full. The server can avoid overload by dropping packets before investing many resources in them.

• AccountingAccounting– Accounting of the resources used by each data flow, including CPU,

memory, and bandwidth.• Explicit SchedulingExplicit Scheduling

– Scheduling and accounting are combined to provide resource guarantees to flows;

• ExtensibilityExtensibility– This makes it easy to add new protocols and construct new network

services

SILK Network Path SILK Network Path

QoS CrosstalkQoS Crosstalk• Would like to “reserve” some amount of resource (e.

g., CPU time) for each application so it can meet its QoS requirements.

• Problem: Not all CPU time is properly accounted for!– e.g., When TCP stack is processing incoming packets, the CPU

time is being used by the kernel, not any specific application– Same goes for handling file system requests, disk I/O, page fa

ults, etc.• It is sometimes very hard to tell which app the OS is doing work f

or!• Result: Activity of one application can impact perform

ance of another– This is called QoS Crosstalk– Applications are not properly isolated from one another

Linux raw socketLinux raw socket

• The basic concept of low level sockets is The basic concept of low level sockets is to send a single packet at one time, with to send a single packet at one time, with all the protocol headers filled in by the pall the protocol headers filled in by the program (instead of the kernel). rogram (instead of the kernel).

operating system support for planetary scale network service

Technology

planetlab planetlab

slices isolation

isolating planetlab

node manager

single node

linux socket

ownslice planetlab services

scale network services