deep dive in docker overlay networks

42
Deep dive in Docker Overlay Networks Laurent Bernaille @lbernail

Upload: laurent-bernaille

Post on 22-Jan-2018

311 views

Category:

Presentations & Public Speaking


8 download

TRANSCRIPT

Page 1: Deep Dive in Docker Overlay Networks

Talk Title HereAuthor Name, Company

Deep dive in Docker Overlay NetworksLaurent Bernaille

@lbernail

Page 2: Deep Dive in Docker Overlay Networks

Agenda

• The Docker Overlay Network

– Getting started

– Under the hood

• Building our Overlay

– Starting from scratch

– Making it dynamic

Page 3: Deep Dive in Docker Overlay Networks

The Docker Overlay

Page 4: Deep Dive in Docker Overlay Networks

Environment

docker0 docker1

consul

10.0.0.10 10.0.0.11

10.0.0.5

dockerd -H fd:// --cluster-store=consul://consul0:8500 --cluster-advertise=eth0:2376

What is in consul? Not much for now just metadata tree

Page 5: Deep Dive in Docker Overlay Networks

Let's create an Overlay Network

docker0:~$ docker network create --driver overlay \

--internal \

--subnet 192.168.0.0/24 linuxcon

c4305b67cda46c2ed96ef797e37aed14501944a1fe0096dacd1ddd8e05341381

docker1:~$ docker network ls

NETWORK ID NAME DRIVER SCOPE

bec777b6c1f1 bridge bridge local

c4305b67cda4 linuxcon overlay global

3a4e16893b16 host host local

c17c1808fb08 none null local

Page 6: Deep Dive in Docker Overlay Networks

Does it work?

docker0:~$ docker run -d --ip 192.168.0.100 --net linuxcon --name C0 debian sleep infinity

docker1:~$ docker run --net linuxcon debian ping 192.168.0.100

PING 192.168.0.100 (192.168.0.100): 56 data bytes

64 bytes from 192.168.0.100: seq=0 ttl=64 time=1.153 ms

64 bytes from 192.168.0.100: seq=1 ttl=64 time=0.807 ms

docker1:~$ ping 192.168.0.100

PING 192.168.0.100 (192.168.0.100) 56(84) bytes of data.

^C--- 192.168.0.100 ping statistics ---

4 packets transmitted, 0 received, 100% packet loss, time 3024ms

Page 7: Deep Dive in Docker Overlay Networks

What did we build?

Overlay

consul

docker0

C0

eth0

docker1

C1

eth0PING

192.168.0.100 192.168.0.Y

10.0.0.10 10.0.0.11

Page 8: Deep Dive in Docker Overlay Networks

The Docker OverlayUnder the hood

Page 9: Deep Dive in Docker Overlay Networks

How does it work? Let's look inside containers

docker0:~$ docker exec C0 ip addr show

58: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP

inet 192.168.0.100/24 scope global eth0

docker0:~$ docker exec C0 ip -details link show dev eth0

58: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default

veth

Page 10: Deep Dive in Docker Overlay Networks

Container network configuration

consul

docker0

eth0192.168.0.100

C0 Namespace

veth

eth0

docker1

C1 Namespace

veth

eth0192.168.0.Y

eth0PING

10.0.0.10 10.0.0.11

Page 11: Deep Dive in Docker Overlay Networks

Where is the other end of the veth?

docker0:~$ ip link show >> Nothing, it must be in another Namespace

docker0:~$ sudo ls -l /var/run/docker/netns

8-c4305b67cd

docker0:~$ docker network inspect linuxcon -f {{.Id}}

c4305b67cda46c2ed96ef797e37aed14501944a1fe0096dacd1ddd8e05341381

docker0:~$ overns=/var/run/docker/netns/8-c4305b67cd

docker0:~$ sudo nsenter --net=$overns ip -d link show

2: br0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1450 qdisc noqueue state UP mode DEFAULT group default

bridge

62: vxlan1: <..> mtu 1450 qdisc noqueue master br0 state UNKNOWN mode DEFAULT group default

vxlan id 256 srcport 10240 65535 dstport 4789 proxy l2miss l3miss ageing 300

59: veth2: <...> mtu 1450 qdisc noqueue master br0 state UP mode DEFAULT group default

Page 12: Deep Dive in Docker Overlay Networks

Update on connectivity

Overlay

consul

docker0

C0

eth0

docker1

C1

eth0PING

192.168.0.100 192.168.0.Y

10.0.0.10 10.0.0.11

br0

vxla

nve

th

br0

vxla

nve

th

Page 13: Deep Dive in Docker Overlay Networks

What is VXLAN?

• Tunneling technology over UDP (L2 in UDP)

• Developed for cloud SDN to create multi-tenancy

• Without the need for L2 connectivity

• Without the normal VLAN limit (4096 VLAN Ids)

• Easy to encrypt: IPSEC

• Overhead: 50 bytes

• In Linux

• Started with Open vSwitch

• Native with Kernel >= 3.7 and >=3.16 for Namespace support

VXLAN: Virtual eXtensible LANVNI: VXLAN Network IdentifierVTEP: VXLAN Tunnel Endpoint

Outer IP packetUDPdst: 4789

VXLANHeader

Original L2

Page 14: Deep Dive in Docker Overlay Networks

Let's have a look

docker0:~$ sudo tcpdump -nn -i eth0 "port 4789"

docker1:~$ docker run -it --rm --net linuxcon debian ping 192.168.0.100

PING 192.168.0.100 (192.168.0.100): 56 data bytes

64 bytes from 192.168.0.100: seq=0 ttl=64 time=1.153 ms

64 bytes from 192.168.0.100: seq=1 ttl=64 time=0.807 ms

docker0:~$

13:35:12.796941 IP 10.0.0.11.60916 > 10.0.0.10.4789: VXLAN, flags [I] (0x08), vni 256

IP 192.168.0.2 > 192.168.0.100: ICMP echo request, id 1, seq 0, length 64

13:35:12.797035 IP 10.0.0.10.54953 > 10.0.0.11.4789: VXLAN, flags [I] (0x08), vni 256

IP 192.168.0.100 > 192.168.0.2: ICMP echo reply, id 1, seq 0, length 64

Page 15: Deep Dive in Docker Overlay Networks

Full connectivity with VXLAN

consul

docker0

C0

eth0

docker1

C1

eth0PING

192.168.0.100 192.168.0.Y

10.0.0.10 10.0.0.11

br0

vxla

nve

th

br0

vxla

nve

th

IPsrc: 10.0.0.11dst: 10.0.0.10

UDPsrc: Xdst: 4789

VXLANHeader

Original L2src: 192.168.0.Ydst: 192.168.0.100

Page 16: Deep Dive in Docker Overlay Networks

How do containers find each other?

• VXLAN Data plane

– Sending data between hosts

– Tunneling using UDP

• VXLAN Control plane

– Distribution of VLXAN endpoints ("VTEP")

– Distribution of MAC to VTEP mappings

– ARP offloading (optional, but required without ARP traffic)

Page 17: Deep Dive in Docker Overlay Networks

VXLAN Control Plane - Option 1: Multicast

vxlan vxlan

vxlan

Multicast239.x.x.x

ARP: Who has 192.168.0.2?

L2 discovery: where is 02:42:c0:a8:00:02 ?

Use a multicast group to send traffic for unknown L3/L2 addresses

• PROS: simple and efficient

• CONS: Multicast connectivity not always available (on public clouds for instance)

Page 18: Deep Dive in Docker Overlay Networks

VXLAN Control Plane- Option 2: Point-to-point

vxlan vxlan

Remote IP: point-to-pointSend everything to remote IP

Configure a remote IP address where to send traffic for unknown addresses

• PROS: simple, not need for multicast, very good for two hosts

• CONS: difficult to manage with more than 2 hosts

Page 19: Deep Dive in Docker Overlay Networks

VXLAN Control Plane- Option 3: User-land

vxlan vxlan

daemon daemon

Manual (with a daemon modifying ARP/FDB)ARP: Mac address of 192.168.0.2L2: VTEP (host) for 02:42:c0:a8:00:02

vxlan

daemon

Do nothing, provide ARP / FDB information from outside

• PROS: very flexible

• CONS: requires a daemon and a centralized database of addresses

Page 20: Deep Dive in Docker Overlay Networks

How is it done by Docker?

docker0:~$ sudo nsenter --net=$overns ip neighbor show

docker0:~$ sudo nsenter --net=$overns bridge fdb show

docker1:~$ docker run -d --ip 192.168.0.200 --net linuxcon --name C1 debian sleep infinity

docker0:~$ sudo nsenter --net=$overns ip neighbor show

192.168.0.200 dev vxlan0 lladdr 02:42:c0:a8:00:c8 PERMANENT

docker0:~$ sudo nsenter --net=$overns bridge fdb show

02:42:c0:a8:00:c8 dev vxlan0 dst 10.0.0.11 self permanent

Page 21: Deep Dive in Docker Overlay Networks

Where is this information stored?

docker0:~$ net=$(docker network inspect linuxcon -f {{.Id}})

docker0:~$ curl -s http://consul:8500/v1/kv/docker/network/v1.0/network/${net}/

docker0:~$ python/dump_endpoints.py

Endpoint Name: C1

IP address: 192.168.0.200/24

MAC address: 02:42:c0:a8:00:c8

Locator: 10.0.0.11

Endpoint Name: C0

IP address: 192.168.0.100/24

MAC address: 02:42:c0:a8:00:64

Locator: 10.0.0.10

Page 22: Deep Dive in Docker Overlay Networks

How is it distributed?

docker0:~$ serf agent -join 10.0.0.10:7946 -node demo -event-handler=./serf.sh

docker1:~$ docker run -d --net linuxcon debian sleep infinity

docker1:~$ docker rm -f $(docker ps -aq)

docker0:~$

New event: user

join 192.168.0.2 255.255.255.0 02:42:c0:a8:00:02

New event: user

leave 192.168.0.2 255.255.255.0 02:42:c0:a8:00:02

New event: user

leave 192.168.0.200 255.255.255.0 02:42:c0:a8:00:c8

Page 23: Deep Dive in Docker Overlay Networks

Overview

consul

docker0

C0

eth0

docker1

C1

eth0PING

192.168.0.100 192.168.0.Y

10.0.0.10 10.0.0.11

br0

vxla

nve

th

br0

vxla

nve

th

IPsrc: 10.0.0.11dst: 10.0.0.10

UDPsrc: Xdst: 4789

VXLANHeader

Original L2src: 192.168.0.Ydst: 192.168.0.100

dockerd dockerd

ARP

FDB

ARP

FDB

Serf / Gossip

Page 24: Deep Dive in Docker Overlay Networks

Building our OverlayFrom scratch

Page 25: Deep Dive in Docker Overlay Networks

Clean up

docker0:~$ docker rm -f $(docker ps -aq)

docker0:~$ docker network rm linuxcon

docker1:~$ docker rm -f $(docker ps -aq)

Page 26: Deep Dive in Docker Overlay Networks

Start from scratch

docker0 docker1

10.0.0.10 10.0.0.11

Page 27: Deep Dive in Docker Overlay Networks

Step 1: Overlay Namespace

docker0 docker1

10.0.0.10 10.0.0.11

br42

vxla

n4

2

eth0

br42

eth0

vxla

n4

2

Page 28: Deep Dive in Docker Overlay Networks

Creating the Overlay Namespace

ip netns add overns

ip netns exec overns ip link add dev br42 type bridge

ip netns exec overns ip addr add dev br42 192.168.0.1/24

ip link add dev vxlan42 type vxlan id 42 proxy dstport 4789

ip link set vxlan1 netns overns

ip netns exec overns ip link set vxlan42 master br42

ip netns exec overns ip link set vxlan42 up

ip netns exec overns ip link set br42 up

create overlay NScreate bridge in NS

create VXLAN interfacemove it to NSadd it to bridge

bring all interfaces up

setup_vxlan script

Page 29: Deep Dive in Docker Overlay Networks

Step 2: Attach containers

docker0 docker1

10.0.0.10 10.0.0.11

br42

vxla

n4

2

eth0

br42

eth0

vxla

n4

2

veth

veth

eth0192.168.0.20

C0 Namespace C1 Namespace

eth0192.168.0.10

Page 30: Deep Dive in Docker Overlay Networks

Create containers and attach themdocker0

docker run -d --net=none --name=demo debian sleep infinity

ctn_ns_path=$(docker inspect --format="{{ .NetworkSettings.SandboxKey}}" demo)

ctn_ns=${ctn_ns_path##*/}

ip link add dev veth1 mtu 1450 type veth peer name veth2 mtu 1450

ip link set dev veth1 netns overns

ip netns exec overns ip link set veth1 master br42

ip netns exec overns ip link set veth1 up

ip link set dev veth2 netns $ctn_ns

ip netns exec $ctn_ns ip link set dev veth2 name eth0 address 02:42:c0:a8:00:10

ip netns exec $ctn_ns ip addr add dev eth0 192.168.0.10

ip netns exec $ctn_ns ip link set dev eth0 up

docker1

Same with 192.168.0.20 / 02:42:c0:a8:00:20

Create container without net

Create vethSend veth1 to overlay NSAttach it to overlay bridge

Send veth2 to containerRename & Configure

Get NS for container

plumb script

Page 31: Deep Dive in Docker Overlay Networks

Does it ping?

docker0:~$ docker exec -it demo ping 192.168.0.20

PING 192.168.0.20 (192.168.0.20): 56 data bytes

92 bytes from 192.168.0.10: Destination Host Unreachable

docker0:~$ sudo ip netns exec overns ip neighbor show

docker0:~$ sudo ip netns exec overns ip neighbor add 192.168.0.20 lladdr 02:42:c0:a8:00:20 dev vxlan42

docker0:~$ sudo ip netns exec overns bridge fdb add 02:42:c0:a8:00:20 dev vxlan42 self dst 10.0.0.11 \

vni 42 port 4789

docker1: Same with 192.168.0.10, 02:42:c0:a8:00:10 and 10.0.0.10

Page 32: Deep Dive in Docker Overlay Networks

Result

docker0 docker1

10.0.0.10 10.0.0.11

br42

vxla

n4

2

eth0

br42

eth0

vxla

n4

2

veth

veth

eth0192.168.0.20

C0 Namespace C1 Namespace

eth0192.168.0.10

PING

FDB

ARP

FDB

ARP

Page 33: Deep Dive in Docker Overlay Networks

Building our OverlayMaking it dynamic

Page 34: Deep Dive in Docker Overlay Networks

Catching network events: NETLINK

• Kernel interface for communication between Kernel and userspace

• Designed to transfer networking info (used by iproute2)

• Several protocols

– NETLINK_ROUTE

– NETLINK_FIREWALL

• Several notification types, for NETLINK_ROUTE for instance:

– LINK

– NEIGHBOR

• Many events

– LINK: NEWLINK, GETLINK

– NEIGHBOR: GETNEIGH <= information on ARP, L2 discovery queries

Page 35: Deep Dive in Docker Overlay Networks

Using ip monitor

docker0:~$ ip monitor link

docker0:~$ sudo ip link add dev veth1 type veth peer name veth2

32: veth2: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default

link/ether b6:95:d6:b4:21:e9 brd ff:ff:ff:ff:ff:ff

33: veth1: <BROADCAST,MULTICAST> mtu 1500 qdisc noop state DOWN group default

link/ether a6:e0:7a:da:a9:ea brd ff:ff:ff:ff:ff:ff

docker0:~$ ip monitor route

docker0:~$ sudo ip route add 8.8.8.8 via 10.0.0.1

8.8.8.8 via 10.0.0.1 dev eth0

Page 36: Deep Dive in Docker Overlay Networks

What about neighbor events?

docker0:~$ echo 1 | sudo tee -a /proc/sys/net/ipv4/neigh/eth0/app_solicit

docker0:~$ ip monitor neigh

docker0:~$ ping 10.0.0.100

10.0.0.100 dev eth0 FAILED

app_sollicit: generate Netlink message on L2/L3 miss

Page 37: Deep Dive in Docker Overlay Networks

With containersdocker0:~$ sudo ip netns del overns

docker0:~$ sudo ./setup_vxlan 42 overns proxy l2miss l3miss dstport 4789

docker0:~$ sudo ./plumb br42@overns demo 192.168.0.10/24 02:42:c0:a8:00:10

docker0:~$ docker exec demo ip monitor neigh

docker0:~$ docker exec demo ping 192.168.0.20

192.168.0.20 dev eth0 FAILED

Retest from overns namespacedocker0:~$ sudo ip netns exec overns ip monitor neigh

miss 192.168.0.20 dev vxlan42 STALE

Add ARPmiss dev vxlan42 lladdr 02:42:c0:a8:00:20 STALE

Add FDB => ping ok

l2miss/l3missgenerate messages from VXLAN interface

Page 38: Deep Dive in Docker Overlay Networks

Using Netlink & Consul to dynamically find containers

listen to Netlink events in overns Namespace

only act on GETNEIGH events

If l3miss, look up ARP in consuland add neighbor info

If l2miss, lookup MAC locationand add FDB info

Page 39: Deep Dive in Docker Overlay Networks

Let's try!Clean updocker0:~$ sudo ip netns del overns

docker0:~$ sudo ./setup_vxlan 42 overns proxy l2miss l3miss dstport 4789

docker0:~$ sudo ./plumb br42@overns demo 192.168.0.10/24 02:42:c0:a8:00:10

Add data to consul

Test

docker0:~$ sudo python/arpd-consul.py

docker0:~$ docker exec -it demo ping 192.168.0.20

INFO Starting new HTTP connection (1): consul1

INFO L3Miss on vxlan42: Who has IP: 192.168.0.20?

INFO Populating ARP table from Consul: IP 192.168.0.20 is 02:42:c0:a8:00:20

INFO L2Miss on vxlan42: Who has Mac Address: 02:42:c0:a8:00:20?

INFO Populating FIB table from Consul: MAC 02:42:c0:a8:00:20 is on host 10.0.0.11

Page 40: Deep Dive in Docker Overlay Networks

Overview

docker0 docker1

10.0.0.10 10.0.0.11

br42

vxla

n4

2

eth0

br42

eth0

vxla

n4

2

veth

veth

eth0192.168.0.20

C0 Namespace C1 Namespace

eth0192.168.0.10

PING

FDB

ARP

FDB

ARP

consul

Netlink

l2/l3 miss

GETNEIGHevents?

lookup

Page 41: Deep Dive in Docker Overlay Networks

Thank you! Questions?

• Commands / code on github

https://github.com/lbernail/dockercon2017

• Recorded at Dockercon Austin (a few improvements today)

• Detailled blog post

http://techblog.d2-si.eu/2017/04/25/deep-dive-into-docker-overlay-networks-part-1.html

• Do not hesitate to ping me on twitter

@lbernail

Page 42: Deep Dive in Docker Overlay Networks