phd2013 lyamin Высокий пакетрейт на x86-64, берем планку 14.88mpps

Post on 24-May-2015

2.433 Views

Category:

Documents

2 Downloads

Preview:

Click to see full reader

TRANSCRIPT

Высокий пакетрейт на x86-64: берем планку в 14,88 Mpps

<la@highloadlab.com>

01.2012-05.20137594 incidents total

2012-01

2012-02

2012-03

2012-04

2012-05

2012-06

2012-07

2012-08

2012-09

2012-10

2012-11

2012-12

2013-01

2013-02

2013-03

2013-04

2013-05

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

spoof full-connect

Что модно?

• UDP Flood and amplification.• TCP ( SYN ( open|closed|firewalled) | ACK )• ICMP Flood ( smurf )

L7 – is out of style

Кто виноват?

Долбанные инопланетянеstatic unsigned int tcp_timeouts[TCP_CONNTRACK_TIMEOUT_MAX] __read_mostly = {    [TCP_CONNTRACK_SYN_SENT]    = 2 MINS,                              [TCP_CONNTRACK_SYN_RECV]    = 60 SECS,                              [TCP_CONNTRACK_ESTABLISHED] = 5 DAYS,    [TCP_CONNTRACK_FIN_WAIT]    = 2 MINS,                              [TCP_CONNTRACK_CLOSE_WAIT]  = 60 SECS,                              [TCP_CONNTRACK_LAST_ACK]    = 30 SECS,                              [TCP_CONNTRACK_TIME_WAIT]   = 2 MINS,    [TCP_CONNTRACK_CLOSE]       = 10 SECS,                              [TCP_CONNTRACK_SYN_SENT2]   = 2 MINS,/* RFC1122 says the R2 limit should be at least 100 seconds.           Linux uses 15 packets as limit, which corresponds   to ~13-30min depending on RTO. */    [TCP_CONNTRACK_RETRANS]     = 5 MINS,                              [TCP_CONNTRACK_UNACK]       = 5 MINS,                           };

Кто еще виноват?top - 08:16:23 up 39 min,  1 user,  load average: 0.44, 0.16, 0.79Tasks: 158 total,   2 running, 156 sleeping,   0 stopped,   0 zombieCpu0  :  0.0%us,  0.0%sy,  0.0%ni, 89.3%id,  0.0%wa,  0.0%hi, 10.7%si,  0.0%stCpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu8  :  0.0%us,  0.0%sy,  0.0%ni, 49.8%id,  0.0%wa,  0.0%hi, 50.2%si,  0.0%stCpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu11 :  0.0%us,  1.0%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stMem:  32921100k total,  4598792k used, 28322308k free,    15496k buffersSwap:        0k total,        0k used,        0k free,    83252k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND               39 root      20   0     0    0    0 R  100  0.0   0:27.91 [ksoftirqd/8]       1401 root      20   0     0    0    0 S    8  0.0   0:03.05 [kpktgend_8]       5346 root      20   0     0    0    0 S    2  0.0   0:00.34 [kworker/8:0]       5740 root      20   0 19356 1472 1076 R    1  0.0   0:00.12 top

Кто еще виноват?

ВЫ(забыли настроить сетевой стэк)

Cферический сервер ввакууме

Intel(R) Xeon(R) CPU E5-2670 x2X520-DA2 (Intel® 82599ES)

Vanilla Linux 3.7.9

modprobe ixgbe(3.13.10) RSS=8

Как быть ?

AFFINITY > BALANCER%/etc/init.d/irqbalancer stop

%grep eth8 /proc/interrupts 123: 19 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 PCI MSI edge eth8 TxRx 0 124: 0 15 0 0 0 0 0 0 1 0 0 0 0 0 0 0 PCI MSI edge eth8 TxRx 1 125: 0 0 15 0 0 0 0 0 1 0 0 0 0 0 0 0 PCI MSI edge eth8 TxRx 2 126: 0 0 0 15 0 0 0 0 1 0 0 0 0 0 0 0 PCI MSI edge eth8 TxRx 3 127: 0 0 0 0 15 0 0 0 1 0 0 0 0 0 0 0 PCI MSI edge eth8 TxRx 4 128: 0 0 0 0 0 15 0 0 1 0 0 0 0 0 0 0 PCI MSI edge eth8 TxRx 5 129: 0 0 0 0 0 0 17 0 1 0 0 0 0 0 0 0 PCI MSI edge eth8 TxRx 6 130: 0 0 0 0 0 0 0 15 1 0 0 0 0 0 0 0 PCI MSI edge eth8 TxRx 7

Лучше?top - 07:40:25 up 3 min,  1 user,  load average: 4.61, 1.29, 0.44 Tasks: 164 total,   9 running, 155 sleeping,   0 stopped,   0 zombie Cpu0  :  0.0%us,  0.0%sy,  0.0%ni, 49.8%id,  0.0%wa,  0.0%hi, 50.2%si,  0.0%st Cpu1  :  0.0%us,  0.0%sy,  0.0%ni, 49.8%id,  0.0%wa,  0.0%hi, 50.2%si,  0.0%st Cpu2  :  0.0%us,  0.0%sy,  0.0%ni, 49.8%id,  0.0%wa,  0.0%hi, 50.2%si,  0.0%st Cpu3  :  0.0%us,  0.0%sy,  0.0%ni, 49.8%id,  0.0%wa,  0.0%hi, 50.2%si,  0.0%st Cpu4  :  0.0%us,  0.0%sy,  0.0%ni, 49.8%id,  0.0%wa,  0.0%hi, 50.2%si,  0.0%st Cpu5  :  0.0%us,  0.0%sy,  0.0%ni, 49.8%id,  0.0%wa,  0.0%hi, 50.2%si,  0.0%st Cpu6  :  0.0%us,  0.0%sy,  0.0%ni, 49.8%id,  0.0%wa,  0.0%hi, 50.2%si,  0.0%st Cpu7  :  0.0%us,  0.0%sy,  0.0%ni, 49.8%id,  0.0%wa,  0.0%hi, 50.2%si,  0.0%st Cpu8  :  0.0%us,  0.0%sy,  0.0%ni, 90.2%id,  0.0%wa,  0.0%hi,  9.8%si,  0.0%st Cpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st Cpu10 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st Cpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st Cpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st Cpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st Cpu14 :  0.0%us,  1.0%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st Cpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st Mem:  32921100k total,  4597288k used, 28323812k free,    15340k buffers Swap:        0k total,        0k used,        0k free,    83240k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND               15 root      20   0     0    0    0 R   96  0.0   0:46.06 [ksoftirqd/2]         23 root      20   0     0    0    0 R   96  0.0   0:46.04 [ksoftirqd/4]         11 root      20   0     0    0    0 R   95  0.0   0:46.04 [ksoftirqd/1]         19 root      20   0     0    0    0 R   95  0.0   0:46.03 [ksoftirqd/3]         27 root      20   0     0    0    0 R   95  0.0   0:46.02 [ksoftirqd/5]         31 root      20   0     0    0    0 R   95  0.0   0:46.08 [ksoftirqd/6]         35 root      20   0     0    0    0 R   95  0.0   0:46.04 [ksoftirqd/7]          3 root      20   0     0    0    0 R   93  0.0   0:45.23 [ksoftirqd/0]

Более лучше?# ethtool K eth8 ntuple on # ethtool U eth8 flow type udp4 action 1 Added rule with ID 8189 # ethtool u eth88 RX rings available  Total 1 rules  Filter: 8189  Rule Type: UDP over IPv4  Src IP addr: 0.0.0.0 mask: 255.255.255.255Dest IP addr: 0.0.0.0 mask: 255.255.255.255  TOS: 0x0 mask: 0xff  Src port: 0 mask: 0xffff  Dest port: 0 mask: 0xffff  VLAN EtherType: 0x0 mask: 0xffff  VLAN: 0x0 mask: 0xffff  User defined: 0x0 mask: 0xffffffffffffffff  Action: Drop

Более лучше!(здесь ~14.88Mpps UDP)

Tasks: 163 total,   1 running, 162 sleeping,   0 stopped,   0 zombieCpu0  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu1  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu2  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu3  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu4  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu5  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu6  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu7  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu8  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu9  :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu10 :  1.0%us,  0.0%sy,  0.0%ni, 99.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu11 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu12 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu13 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu14 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stCpu15 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%stMem:  32921100k total,  4374344k used, 28546756k free,     7700k buffersSwap:        0k total,        0k used,        0k free,    24036k cached

 PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND           4348 root      20   0 19356 1476 1076 R    1  0.0   0:00.03 top                    1 root      20   0  4120  688  588 S    0  0.0   0:01.22 init [3]              2 root      20   0     0    0    0 S    0  0.0   0:00.00 [kthreadd]        

Поприветствуем Flow Director

The flow director filters identify specific flows or sets of flows and routes them to specific queues. The flow director filters are programmed by FDIRCTRL and all other FDIR registers. The 82599 shares the Rx packet buffer for the storage of these filters.

Flow Director умеет

• Perfect match filters — The hardware checks a match between the masked fields of the received packets and the programmed filters. Masked fields should be programmed as zeros in the filter context. The 82599 support up to 8 K - 2 perfect match filters.

• Signature filters — The hardware checks a match between a hash-based signature of the masked fields of the received packet. The 82599 supports up to 32 K - 2 signature filters.

Perfect Filter умеют(instanteneously)

• VLAN• proto• src_ip/mask• src_port• dst_ip/mask• dst_port• Flexible 2-byte tuple anywhere in the first 64

bytes of the packet (FRAME!)

Not so perfect

(Выкидыш FlowDirector)

• Потребляют память RX buffer (256/512)• Не умеют ЕСЛИ-ТО• Masks are GLOBAL for signature filters• 64b это до обидного мало• Поддерживается ethtool (perfect, buggy) и

PF_RING(signature only)

Но и на том Intel SPASIBO!

Flex Filters

(Выкидыши реализации RSS)

• 128b of the packet (FRAME!)• 6 filters• Кратковременно отключаются при

доступе(R|W)• Нет публично доступного userland

конфигуратора.

Как быть с TCP SYN?

• SYN без Seq Number• SYN без MSS• … и прочие ляпы где можно вывести

сигнатуру до первых 128b

Как быть с Perfect TCP SYN ?

Больно умереть на 400kPPS…

Post mortem# ========## Samples: 19K of event 'cycles'# Event count (approx.): 12923232073## Overhead Command Shared Object Symbo l# ........ ........... ................. .................................... .# 78.74% ksoftirqd/0 [kernel.kallsyms] [k] _raw_spin_lock | _raw_spin_lock | | 98.84% tcp_v4_rcv | ip_local_deliver_finish | ip_local_deliver | ip_rcv_finish | ip_rcv | __netif_receive_skb | netif_receive_skb | napi_skb_finish | napi_gro_receive | 0xffffffffa005c134 | net_rx_action | __do_softirq | run_ksoftirqd | smpboot_thread_fn | kthread | ret_from_fork

net/ipv4/tcp_ipv4.c

process: if (sk >sk_state == TCP_TIME_WAIT) goto do_time_wait; if (unlikely(iph >ttl < inet_sk(sk) >min_ttl)) { NET_INC_STATS_BH(net, LINUX_MIB_TCPMINTTLDROP); goto discard_and_relse;} if (!xfrm4_policy_check(sk, XFRM_POLICY_IN, skb)) goto discard_and_relse; nf_reset(skb); if (sk_filter(sk, skb)) goto discard_and_relse; skb >dev = NULL; bh_lock_sock_nested(sk);ret = 0; if (!sock_owned_by_user(sk)) {

[dd]

} bh_unlock_sock(sk);sock_put(sk);return ret;

Опять кто-то виноват..

• Обнаружитель SYNFLOOD• TCP Cookie Transactions• MD5SUM

Запилим Инновационный Костыль!

*rawpost:POSTROUTING ACCEPT [15:1548] A POSTROUTING s 10.1.0.0/24 o eth8 j RAWSNAT to source 10.10.40.3/32COMMIT# Completed on Mon May 20 04:47:30 2013# Generated by iptables save v1.4.16.3 on Mon May 20 04:47:30 2013*raw:PREROUTING ACCEPT [28:2128]:OUTPUT ACCEPT [18:2056] A PREROUTING d 10.10.40.3/32 m cpu cpu 0 j RAWDNAT to destination 10.1.0.1/32 A PREROUTING d 10.10.40.3/32 m cpu cpu 1 j RAWDNAT to destination 10.1.0.2/32 A PREROUTING d 10.10.40.3/32 m cpu cpu 2 j RAWDNAT to destination 10.1.0.3/32 A PREROUTING d 10.10.40.3/32 m cpu cpu 3 j RAWDNAT to destination 10.1.0.4/32 A PREROUTING d 10.10.40.3/32 m cpu cpu 4 j RAWDNAT to destination 10.1.0.5/32 A PREROUTING d 10.10.40.3/32 m cpu cpu 5 j RAWDNAT to destination 10.1.0.6/32 A PREROUTING d 10.10.40.3/32 m cpu cpu 6 j RAWDNAT to destination 10.1.0.7/32 A PREROUTING d 10.10.40.3/32 m cpu cpu 7 j RAWDNAT to destination 10.1.0.8/32COMMIT

WIN!

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

500

1000

1500

2000

2500

3000

RSS

kPPS

Вопросы?

На сладкое

А что будет если послать пакет на не слушаемый порт?

А что если послать много-много пакетов?

Linux 3.5.7

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

100

200

300

400

500

600

RSS

KPPS

net/ipv4/ip_output.c bh_lock_sock(sk); inet >tos = arg >tos; sk >sk_priority = skb >priority; sk >sk_protocol = ip_hdr(skb) >protocol; sk >sk_bound_dev_if = arg >bound_dev_if; ip_append_data(sk, &fl4, ip_reply_glue_bits, arg >iov >iov_base, len, 0, &ipc, &rt, MSG_DONTWAIT); if ((skb = skb_peek(&sk >sk_write_queue)) != NULL) {}if (arg >csumoffset >= 0) *((__sum16 *)skb_transport_header(skb) + arg >csumoffset) = csum_fold(csum_add(skb >csum, arg >csum));skb >ip_summed = CHECKSUM_NONE;ip_push_pending_frames(sk, &fl4);bh_unlock_sock(sk);

Спасибо Эрик!commit be9f4a44e7d41cee50ddb5f038fc2391cbbb4046Author: Eric Dumazet <edumazet@google.com>Date: Thu Jul 19 07:34:03 2012 +0000 ipv4: tcp: remove per net tcp_sock tcp_v4_send_reset() and tcp_v4_send_ack() use a single socket per network namespace.

This leads to bad behavior on multiqueue NICS, because many cpus contend for the socket lock and once socket lock is acquired, extra false sharing on various socket fields slow down the operations.

To better resist to attacks, we use a percpu socket. Each cpu canrun without contention, using appropriate memory (local node)

Спасибо Эрик!

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 160

500

1000

1500

2000

2500

3000

3500

4000

4500

RSS

KPPS

К черту мушкетеров,ПЯТНИЦА!

top related