Bugless #53
Access to k0 pod network from routing fabric is borked
Description
For example, from boston:
$ curl 10.10.25.14:9092 # matrix metrics
will sometimes work and sometimes get stuck.
10.10.25.0/26 is ECMP'd across all k0 hosts:
dcsw01.hswaw.net#show ip route 10.10.25.0/26 Codes: C - connected, S - static, K - kernel, O - OSPF, IA - OSPF inter area, E1 - OSPF external type 1, E2 - OSPF external type 2, N1 - OSPF NSSA external type 1, N2 - OSPF NSSA external type2, B I - iBGP, B E - eBGP, R - RIP, I - ISIS, A B - BGP Aggregate, A O - OSPF Summary, NG - Nexthop Group Static Route B E 10.10.25.0/26 [200/0] via 185.236.240.35, Vlan2001 via 185.236.240.36, Vlan2001 via 185.236.240.39, Vlan2001 via 185.236.240.40, Vlan2001
However, it's a pod IP, so that's only really handled by one node - in this case, dcr01s24 / 185.236.240.40. And it seems like it only works when it gets ECMPd directly to that node, but not otherwise.
But even still, it should be properly bounced off if it hits other nodes, what's going on?
Updated by q3k over 2 years ago
Here's an example of when it hits 10.10.25.14 (on dcr01s24) through bc01n01:
SYN through bc01n01:
bc01n01 $ tcpdump -i eno1 -vv -n -e host 10.10.25.14 and tcp port 9092 00:46:19.295852 00:1c:73:11:8a:83 > 00:23:ae:fe:83:c4, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 61506, offset 0, flags [DF], proto TCP (6), length 60) 185.236.240.38.39384 > 10.10.25.14.9092: Flags [S], cksum 0xf440 (correct), seq 1319813603, win 64240, options [mss 1460,sackOK,TS val 3893286313 ecr 0,nop,wscale 7], length 0
SYN and SYN/ACK on dcr01s24
dcr01s24 $ tcpdump -i enp130s0f0 -vv -n -e host 10.10.25.14 and port 9092 00:46:15.039772 90:1b:0e:08:12:b8 > 90:1b:0e:31:bb:6a, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 62, id 2005, offset 0, flags [DF], proto TCP (6), length 60) 185.236.240.38.39380 > 10.10.25.14.9092: Flags [S], cksum 0x1453 (correct), seq 2397532729, win 64240, options [mss 1460,sackOK,TS val 3893282056 ecr 0,nop,wscale 7], length 0 00:46:15.039917 90:1b:0e:31:bb:6a > 00:23:ae:fe:45:8c, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60) 10.10.25.14.9092 > 185.236.240.38.39380: Flags [S.], cksum 0xcd59 (incorrect -> 0x365b), seq 1341792797, ack 2397532730, win 65236, options [mss 1400,sackOK,TS val 434785853 ecr 3893282056,nop,wscale 7], length 0
And SYN, SYN/ACK, ACK on boston (different flow):
00:50:36.884468 00:23:ae:fe:45:8c > 00:1c:73:11:8a:83, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 64, id 19559, offset 0, flags [DF], proto TCP (6), length 60) 185.236.240.38.39386 > 10.10.25.14.9092: Flags [S], cksum 0xcd59 (incorrect -> 0x7951), seq 1150572718, win 64240, options [mss 1460,sackOK,TS val 3893543902 ecr 0,nop,wscale 7], length 0 00:50:36.884823 90:1b:0e:31:bb:6a > 00:23:ae:fe:45:8c, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60) 10.10.25.14.9092 > 185.236.240.38.39386: Flags [S.], cksum 0x54a5 (correct), seq 2657360781, ack 1150572719, win 65236, options [mss 1400,sackOK,TS val 435047699 ecr 3893543902,nop,wscale 7], length 0 00:50:36.884867 00:23:ae:fe:45:8c > 00:1c:73:11:8a:83, ethertype IPv4 (0x0800), length 66: (tos 0x0, ttl 64, id 19560, offset 0, flags [DF], proto TCP (6), length 52) 185.236.240.38.39386 > 10.10.25.14.9092: Flags [.], cksum 0xcd51 (incorrect -> 0x8014), seq 1, ack 1, win 502, options [nop,nop,TS val 3893543902 ecr 435047699], length 0
So that looks fine so far - SYN from boston goes through intermediary host, SYN/ACK from other side, and boston sends an ACK. But the first sign of trouble is if we look further down at dcr01s24 logs:
00:46:15.039917 90:1b:0e:31:bb:6a > 00:23:ae:fe:45:8c, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60) 10.10.25.14.9092 > 185.236.240.38.39380: Flags [S.], cksum 0xcd59 (incorrect -> 0x365b), seq 1341792797, ack 2397532730, win 65236, options [mss 1400,sackOK,TS val 434785853 ecr 3893282056,nop,wscale 7], length 0 00:46:16.095067 90:1b:0e:31:bb:6a > 00:23:ae:fe:45:8c, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 63, id 0, offset 0, flags [DF], proto TCP (6), length 60) 10.10.25.14.9092 > 185.236.240.38.39380: Flags [S.], cksum 0xcd59 (incorrect -> 0x323c), seq 1341792797, ack 2397532730, win 65236, options [mss 1400,sackOK,TS val 434786908 ecr 3893282056,nop,wscale 7], length 0
that's a SYN/ACK retransmit! And then even further down:
00:46:17.013760 00:23:ae:fe:83:20 > 90:1b:0e:31:bb:6a, ethertype IPv4 (0x0800), length 74: (tos 0x0, ttl 62, id 58262, offset 0, flags [DF], proto TCP (6), length 60) 185.236.240.38.39382 > 10.10.25.14.9092: Flags [S], cksum 0x0ae5 (correct), seq 3264954427, win 64240, options [mss 1460,sackOK,TS val 3893284030 ecr 0,nop,wscale 7], length 0
another SYN received from boston for a different flow.
So it seems like dcr01s24 never gets the ACK from boston, and retransmits SYN/ACKs, while boston attempts another connection.
That's odd.
Updated by q3k over 2 years ago
Can this be because the SYN/ACK (dcr01s24 -> boston) is a direct server return (bypassing bc01n01), and that means that bc01n01 is dropping the ACK (also going through it, because ECMP is stable across the same TCP flow 5-tuple) as it can't follow that flow anymore? Do we have some iptables FORWARD rules that would reject unknown TCP flow packets?
Updated by q3k over 2 years ago
Chain KUBE-FORWARD (1 references) pkts bytes target prot opt in out source destination 32 3344 DROP all -- * * 0.0.0.0/0 0.0.0.0/0 ctstate INVALID 0 0 ACCEPT all -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes forwarding rules */ mark match 0x4000/0x4000 242 97612 ACCEPT all -- * * 10.10.16.0/20 0.0.0.0/0 /* kubernetes forwarding conntrack pod source rule */ ctstate RELATED,ESTABLISHED 255 13330 ACCEPT all -- * * 0.0.0.0/0 10.10.16.0/20 /* kubernetes forwarding conntrack pod destination rule */ ctstate RELATED,ESTABLISHED
Hmmm.... And that DROP target counter certainly is increasing if I spam curl on boston...
Updated by q3k over 2 years ago
Seems like setting a sysfs tunable should be able to help us with this asymmetric routing issue:
https://github.com/kubernetes/kubernetes/issues/94861#issuecomment-796779433
I'll manually flip the following on our machines:
sysctl net.netfilter.nf_conntrack_tcp_be_liberal=1
And if that helps, we'll deploy it properly.
Updated by q3k over 2 years ago
Flipped all machines:
$ for m in bc01n{01,02} dcr01s{22,24}; do ssh root@$m.hswaw.net sysctl -a | grep conntrack | grep liberal; done net.netfilter.nf_conntrack_tcp_be_liberal = 1 net.netfilter.nf_conntrack_tcp_be_liberal = 1 net.netfilter.nf_conntrack_tcp_be_liberal = 1 net.netfilter.nf_conntrack_tcp_be_liberal = 1
But that doesn't seem to have helped...
Updated by q3k over 2 years ago
Yeah, I don't think that's gonna help.
We have to either hack kube-proxy to remove that rule (although removing it might cause us to trigger https://github.com/kubernetes/kubernetes/issues/74839), or to stop all nodes from announcing all pod networks from every machine.
I'm now leaning towards the second, as I think that's the right solution (why would all machines announce all pod networks? it's probably a bug in our calico hacks that I didn't see) - but it will require mangling some more calico bird rule templates...
Updated by q3k over 2 years ago
- Status changed from New to Assigned
- Assignee set to implr
Punting this over to implr, as he's holding the backlog of calico things to be fixed or aware of wrt. upgrading it.