Enabling Fast, Dynamic Network Processing with ClickOS
Joao Martins*, Mohamed Ahmed*, Costin Raiciu§, Roberto Bifulco*, Vladimir Olteanu§, Michio Honda*, Felipe Huici*
* NEC Labs Europe, Heidelberg, Germany
§ University Politehnica of Bucharest
The Idealized Network
Physical
Datalink
Network
Transport
Application
Physical
Datalink
Network
Transport
Application
Physical
Datalink
Network
Physical
Datalink
Page 2
A Middlebox World
Page 3
carrier-grade NAT
load balancer
DPIQoE monitor
ad insertion
BRAS
session border controller
transcoder
WAN accelerator
DDoS protection
firewall
IDS
Hardware Middleboxes - Drawbacks
▐ Middleboxes are useful, but…ExpensiveDifficult to add new features, lock-inDifficult to manageCannot be scaled with demandCannot share a device among different tenantsHard for new players to enter market
▐ Clearly shifting middlebox processing to a software-based, multi-tenant platform would address these issuesBut can it be built using commodity hardware while still
achieving high performance?
▐ ClickOS: tiny Xen-based virtual machine that runs Click
Page 4
Click Runtime
▐ Modular architecture for network processing
▐ Based around the concept of “elements”▐ Elements are connected in a configuration
file▐ A configuration is installed via a command
line executable (e.g., click-install router.click)
▐ An element Can be configured with parameters
(e.g., Queue::length) Can expose read and write variables available
via sockets or the /proc system under Linux
(e.g., Counter::reset, Counter::count) Compiled 262/300 elements Programmers can write new ones to extend
Click runtime
Page 5
A simple (click-based) firewall example
Page 6
in :: FromNetFront(DEVMAC 00:11:22:33:44:55, BURST 1024);
out :: ToNetFront(DEVMAC 00:11:22:33:44:55, BURST 1);
filter :: IPFilter(
allow src host 10.0.0.1 && dst host 10.1.0.1 && udp,
drop all);
in -> CheckIPHeader(14) -> filter
filter[0] -> Print(“allow”) -> out;
filter[1] -> Print(“drop”) -> Discard();
What's ClickOS ?
domU
paravirt
apps
guestOS
ClickOS
paravirt
Click
miniOS
Page 7
▐ Work consisted of:Build system to create ClickOS images (5 MB in size)Emulating a Click control plane over MiniOS/XenReducing boot times (roughly 30 miliseconds)Optimizations to the data plane (10 Gb/s for almost all pkt sizes)
Performance analysis
Page 8
netback
Driver Domain (or Dom 0) ClickOS Domain
Xen bus/store
Event channel
netfront
Xen ring API(data)
NW driver Linux/OVS bridge
vif
Click
FromNetfront
ToNetfront
300* Kp/s 350 Kp/s 225 Kp/s* - maximum-sized packets
pkt size (bytes) 10Gb rate
64 14.8 Mp/s
128 8.4 Mp/s
256 4.5 Mp/s
512 2.3 Mp/s
1024 1.2 Mp/s
1500 810 Kp/s
Main issues
© NEC Corporation 2009Page 9
▐ Backend switch ( bridge / openvswitch ) are slow
▐ Copying pages between domains (grant copy) greatly affects packet I/O– These are done in batches, but still expensive
▐ Packet metadata (skb or mbufs) allocations
▐ MiniOS netfront not as good as Linux – 225 Kpps VS 430 Kpps Tx– only 8 Kpps Rx
Optimizing Network I/O – Backend Switch
Page 10
VALE
netback
Driver Domain (or Dom 0) ClickOS Domain
netfrontXen bus/store
Event channel
Xen ring API(data)
NW driver(netmap mode)
port
Click
FromNetfront
ToNetfront
▐ Introduce VALE as the backend switch
– NIC switches to netmap-mode
▐ Slight modifications to the netback driver only
▐ Batch more I/O requests through multi-page rings
▐ Removed packet metadata manipulation
▐ 625 Kpps (1500 size, 2.7x improvement) and 1.2 Mpps (64 size, 4.2x improvement)
Background - Netmap
Page 11
▐ Fast packet I/O framework
– 14.88 Mpps on 1 core at 900 Mhz
▐ Available in FreeBSD 9+
– Also runs on Linux
▐ Minimal device driver modifications
– Critical resources (NIC registers, physical buffer addresses, and descriptors) not exposed to the user
– NIC works in special mode, bypassing the host stack
▐ Amortize syscalls cost by using large batches
▐ Preallocated packet buffers, and memory mapped to userspace
Netmap – a novel framework for fast packet I/Ohttp://info.iet.unipi.it/~luigi/netmap/Luigi RizzoUniversita di Pisa
Background - VALE Software Switch
Page 12
▐ High performance switch based on netmap API (18 Mpps between virtual ports, one CPU core)
▐ Packet processing is “modular”
– Default as learning bridge
– Modules are independent kernel modules▐ Applications use the netmap API
VALE, a Virtual Local Ethernethttp://info.iet.unipi.it/~luigi/vale/Luigi Rizzo, Giuseppe LettieriUniversita di Pisa
VALE
Optimizing Network I/O
Page 13
Driver Domain (or Dom 0) ClickOS Domain
netfront
NW driverClick
FromNetfront
ToNetfront
netback
Xen bus/store
TX/RX Event channels
Netmap API(data)
▐ No longer need the extra copy between domains
▐ Netmap rings (in the VALE switch) are mapped all the way to the guest
▐ An I/O request doesn't require a response to be consumed by the guest
▐ Event channels are used to proxy netmap operations from/to guest and VALE
▐ Breaks other (non-MiniOS) guests :(
– But we have implemented a netmap-based Linux netfront driver
Vale
Netback (Xen)
netback
netfront app.netmap API
Driver Domain
Mini-OS
3. ring/bufs pages granted
Initialization
buf slot [0]buf slot [1]buf slot [2]
slots KB (per ring)
# grants(per ring)
64 135 33
128 266 65
256 528 130
512 1056 259
1024 2117 516
2048 4231 1033
Optimizing Network I/O – Initialization and Memory usage
4. ring grant refs read from the xenstore buffer refs read from the mapped ring slot
VALE
1. opens netmap device2. registers a VALE port
▐ Netmap buffers are contiguous pages in guest memory
▐ Buffers are 2k in size, each page fits 2 buffers
▐ Ring fits 1 page for 64 and 128 slots; (2+ for 256+ slots)
netmap buffers pool
Vale
Netback (Xen)
VALE
netback netfront app
Domain-0
Guest (Mini-OS)
Backend finished
Packets to transmit
TX event channel
buf slot 0buf slot 1buf slot 2
Optimizing Network I/O – Synchronization
buf slot 0buf slot 1buf slot 2
(mapped)
▐ In netmap application, operation is done in sender context
▐ Backend/Frontend private copy not included in the shared ring page(s)
▐ Event channels used for synchronization
EVALUATION
ClickOS Base Performance
RX TX
Intel Xeon E1220 4-core 3.2GHz, 16GB RAM, dual-port Intel x520 10Gb/s NIC. One CPU core assigned to VM, the rest to dom0
Scaling out – Multiple NICs/VMs
Intel Xeon E1650 6-core 3.2GHz, 16GB RAM, dual-port Intel x520 10Gb/s NIC. 3 cores assigned to VMs, 3 cores for dom0
Linux Guest Performance
ClickOS (virtualized) Middlebox Performance
ClickOS Delay vs. Other Systems
Conclusions
Presented ClickOS:Tiny (5MB) Xen VM tailored at network processingCan be booted (on demand) in 30 millisecondsCan achieve 10Gb/s throughput using only a single core.Can run a varied range of middleboxes with high throughput
Page 22
Future work:Improving performance on NUMA systemsHigh consolidation of ClickOS VMs (thousands)Service chaining
MiniOS (pkt-gen) Performance
RX TX
Scaling Out – Multiple VMs TX
ClickOS VM and middlebox Boot time
30 milliseconds
220 milliseconds