一, pod 访问外界服务时断时续

1 现象

环境:

k8s calico 网络插件 BGP 模式

redis 单节点

问题:

部分 pod 与 redis 无法通信,telnet 端口不通,有这几种情况

b pod 一开始可以通 redis,我删了重构下,可能在 node1,也可能在 node2,但是再 Telnet redis 就不通了

d pod 一直访问不了 redis,我删了重构下,可能在 node1,也可能在 node2,但是再 Telnet redis 就通了

2 排查

2.1 路由

2.1.1 查看无法通信 pod 信息

1
2
3

kubectl get pod -n test -owide
NAME                             READY   STATUS    RESTARTS   AGE     IP               NODE
service-21189-54f674b658-ksj2b   1/1     Running   1          13h     172.18.239.103   node1

2.1.2 到对应的节点查看路由表

这里发现个问题,不知道是不是集群问题,按理来说 podip 网段路由到 bond1 才对,不知道 169.254 这网段做什么的,但是不影响

访问路径: pod -> bond0 -> redis

route -n
Kernel IP routing table
Destination     Gateway         Genmask         Flags Metric Ref    Use Iface
0.0.0.0         10.2.94.254   0.0.0.0         UG    0      0        0 bond0
10.2.94.0     0.0.0.0         255.255.255.0   U     0      0        0 bond0
169.254.0.0     0.0.0.0         255.255.0.0     U     1009   0        0 bond0
169.254.0.0     0.0.0.0         255.255.0.0     U     1010   0        0 bond1
172.16.58.0     0.0.0.0         255.255.255.0   U     0      0        0 docker0
172.18.239.103  0.0.0.0         255.255.255.255 UH    0      0        0 cali1b3b587d00a
......
192.168.2.0   0.0.0.0         255.255.255.0   U     0      0        0 virbr0

2.1.3 traceroute 验证下

可以看到跟上面猜想一样


traceroute -n -T -p 6380 10.2.1.94
traceroute to 10.2.33.94 (10.2.1.94), 30 hops max, 52 byte packets
 1  10.2.12.8  6.269 ms  6.967 ms  5.402 ms
 2  10.2.12.254  2.269 ms  1.967 ms  1.402 ms
 3  10.9.10.2  1.458 ms  1.347 ms  1.215 ms
 4  10.2.205.2  1.059 ms  0.888 ms  0.828 ms
 5  10.2.205.54  1.175 ms  1.438 ms  1.197 ms
 6  10.2.1.254  1.956 ms  1.836 ms  1.834 ms   # 这里也有个问题,某些pod会出现找不到下一跳的问题
 7  10.2.1.94  1.207 ms  1.212 ms  1.196 ms

2.2 测试不通的网络访问别的服务

能访问说明 pod 网络没任何问题,重点排查 redis

curl -v  www.baidu.com
* Rebuilt URL to: www.baidu.com/
*   Trying 110.242.68.4...
* TCP_NODELAY set
* Connected to www.baidu.com (110.242.68.4) port 80 (#0)
> GET / HTTP/1.1
> Host: www.baidu.com
> User-Agent: curl/7.52.1
> Accept: */*
>
< HTTP/1.1 200 OK
< Accept-Ranges: bytes
< Cache-Control: private, no-cache, no-store, proxy-revalidate, no-transform
< Connection: keep-alive
< Content-Length: 2381
< Content-Type: text/html
< Date: Fri, 15 Apr 2022 02:37:57 GMT
< Etag: "588604c1-94d"
< Last-Modified: Mon, 23 Jan 2017 13:27:29 GMT
< Pragma: no-cache
< Server: bfe/1.0.8.18
< Set-Cookie: BDORZ=27315; max-age=86400; domain=.baidu.com; path=/

2.3 在 redis 节点抓包查看

tcpdump -i eth0 port 6379 -w redis.pcap

2.3.1 使用 Wireshark 查看抓的包

通过抓包发现,pod 访问发送 tcp 握手请求,而 redis 没有回应

3 结果

通过排查 Linux tcp 连接数不够了

二, 业务 pod 告警 No route to host

1 先测试解析发现第一次稍慢,但是后续正常

nslookup test.ccops.cc 10.1.1.1
Server:10.1.1.1
Address:10.1.1.1#53

Non-authoritative answer:
test.ccops.cc name = bn1--test.ccops.cc.
Name:bn1--test.ccops.cc
Address: 410.1.1.2

2 排查 coredns

2.1 检查配置

配置无任何问题

1	forward . 10.1.1.1:53

2.2 检查日志

日志可以看到 coredns 转发出现超时

1 2	kubectl logs -f -n kube-system -l k8s-app=kube-dns 2022/11/22 05:57:48 [ERROR] 2 test.ccops.cc. A: read udp 172.16.71.6:46422->10.1.1.1:53: i/o timeout