问题
请求网关转发接口长时间无响应,服务无任何错误日志;但是直接请求服务接口正常。
问题定位
TCP问题排查
- 通过nestat查看tcp连接
bash-4.3$ netstat -tap
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:http-alt 0.0.0.0:* LISTEN 1/java
tcp 0 0 trade-application-gateway-6cbfcc888b-69t52:55230 service-admin.default.svc.cluster.local:http-alt ESTABLISHED 1/java
tcp 0 0 trade-application-gateway-6cbfcc888b-69t52:60050 service-admin.default.svc.cluster.local:http-alt ESTABLISHED 1/java
tcp 0 0 trade-application-gateway-6cbfcc888b-69t52:35014 service-admin.default.svc.cluster.local:http-alt ESTABLISHED 1/java
tcp 0 0 trade-application-gateway-6cbfcc888b-69t52:40334 service-admin.default.svc.cluster.local:http-alt ESTABLISHED 1/java
tcp 1 0 localhost:http-alt localhost:55520 CLOSE_WAIT 1/java
........
发现有大量到service-admin.default.svc.cluster.local:http-alt的连接未释放,总共50个
- 查看配置文件
bash-4.3$ cat application.yml
spring:
application:
name: trade-application-gateway
.....
ribbon:
MaxTotalConnections: 500
MaxConnectionsPerHost: 50
....
发现每个服务最大连接为50
- 查看连接创建时间
bash-4.3$ cat /proc/1/net/tcp
sl local_address rem_address st tx_queue rx_queue tr tm->when retrnsmt uid timeout inode
0: 00000000:1F90 00000000:0000 0A 00000000:00000000 00:00000000 00000000 2 0 4116105505 1 ffff91f1b766cd80 100 0 0 10 0
1: E5A2050A:A80E 41068898:0050 08 00000000:00000001 00:00000000 00000000 2 0 4155820713 1 ffff91f1a59fae80 20 4 30 10 -1
2: E5A2050A:E1D0 1F08040A:2387 01 00000000:00000000 02:0009C872 00000000 2 0 4116089078 2 ffff91f1b7668f80 20 4 24 10 -1
3: E5A2050A:9220 52C5EA3E:0050 08 00000000:00000001 00:00000000 00000000 2 0 4117890325 1 ffff91f1ee5445c0 21 4 28 10 -1
4: E5A2050A:E4DA 91BA050A:1F90 01 00000000:00000000 00:00000000 00000000 2 0 4116594566 1 ffff91f15587be00 20 4 30 10 -1
5: E5A2050A:E490 91BA050A:1F90 01 00000000:00000000 00:00000000 00000000 2 0 4116586431 1 ffff91f5fe57f440 20 4 30 10 -1
.......
获取到某个连接的inode,然后查看操作符
bash-4.3$ ls -all /proc/1/fd/ |grep 4116594566
lrwx------ 1 daemon daemon 64 Jun 11 17:29 154 -> socket:[4116594566]
发现该inode是一个socket连接,创建时间为 6.11 17:29
- 查看连接默认超时时间
sysctl -a |grep keepalive
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75
TCP连接超时时间正常
抓包-抓取service-admin的tcp报文
- 查找该容器对应的进程id
docker inspect dac3c06a1c78
......
"State": {
"Status": "running",
"Running": true,
"Paused": false,
"Restarting": false,
"OOMKilled": false,
"Dead": false,
"Pid": 28510,
"ExitCode": 0,
"Error": "",
"StartedAt": "2020-06-12T06:05:46.85956101Z",
"FinishedAt": "0001-01-01T00:00:00Z"
},
......
- 进入该进程network namespace
nsenter -n -t 28510
- 查看tcp连接
netstat -t
- 抓包
tcpdump
- 发现service-admin一直在向trade-application-gateway不停的发送报文,导致连接一直未断开
14:47:15.819812 IP (tos 0x0, ttl 64, id 61117, offset 0, flags [DF], proto TCP (6), length 69)
VM_1_10_centos.webcache > 10.5.162.229.58814: Flags [P.], cksum 0x5c72 (incorrect -> 0xdb55), seq 711925729:711925746, ack 811435878, win 276, options [nop,nop,TS val 297204448 ecr 297518045], length 17: HTTP
0x0000: 4500 0045 eebd 4000 4006 efba 0a05 a54b E..E..@.@......K
0x0010: 0a05 a2e5 1f90 e5be 2a6f 1fe1 305d 8766 ........*o..0].f
0x0020: 8018 0114 5c72 0000 0101 080a 11b6 fae0 ....\r..........
0x0030: 11bb c3dd 360d 0a3a 7069 6e67 0a0d 0a31 ....6..:ping...1
0x0040: 0d0a 0a0d 0a .....
14:47:15.819959 IP (tos 0x0, ttl 62, id 39060, offset 0, flags [DF], proto TCP (6), length 52)
10.5.162.229.58814 > VM_1_10_centos.webcache: Flags [.], cksum 0x08ea (correct), seq 1, ack 17, win 229, options [nop,nop,TS val 297528045 ecr 297204448], length 0
0x0000: 4500 0034 9894 4000 3e06 47f5 0a05 a2e5 E..4..@.>.G.....
0x0010: 0a05 a54b e5be 1f90 305d 8766 2a6f 1ff2 ...K....0].f*o..
0x0020: 8010 00e5 08ea 0000 0101 080a 11bb eaed ................
0x0030: 11b6 fae0
上述过程让我们知道了tcp未关闭的原因,但是服务器为什么一直向浏览器发送报文呢,我们继续探究问题本质
通过模拟spring boot admin ui请求定位问题
- 在spring boot admin ui上我们观测到request header中存在keep-alive和Accept: text/event-stream 比较奇特
- 通过wrk工具首次模拟service admin ui的请求,添加上Proxy-Connection:keep-alive;发现连接能够正常释放
wrk --latency -H "Proxy-Connection: keep-alive" -t50 -c50 http://b2c-qa.kanche.com/admin/applications
- 通过wrk工具第二次模板请求,添加所有的header,通过
netstat -atp
发现连接一直未释放
wrk --latency -H "Host: b2c-qa.xxx.com" -H "Proxy-Connection: keep-alive" -H "Accept: text/event-stream" -H "Cache-Control: no-cache" -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36" -H "Referer: http://b2c-qa.kanche.com/admin" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: zh-CN,zh;q=0.9,en;q=0.8" -H "Cookie: _ga=GA1.2.2133472313.1590731594; _fbp=fb.1.1590731632442.287467853" -t1 -c1 http://b2c-qa.xxx.com/admin/applications
为了结果更准确,建议在模拟请求前,重启服务以关闭其它所有的tcp连接
- 通过wrk工具第三次模板请求,在所有header的基础上去掉
Accept: text/event-stream
,通过netstat -atp
发现连接能够正常释放
wrk --latency -H "Host: b2c-qa.kanche.com" -H "Proxy-Connection: keep-alive" -H "Cache-Control: no-cache" -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36" -H "Referer: http://b2c-qa.kanche.com/admin" -H "Accept-Encoding: gzip, deflate" -H "Accept-Language: zh-CN,zh;q=0.9,en;q=0.8" -H "Cookie: _ga=GA1.2.2133472313.1590731594; _fbp=fb.1.1590731632442.287467853" -t1 -c1 http://b2c-qa.kanche.com/admin/applications
故问题在于text/event-stream
,经查询资料发现该技术为EventSource,可实现服务器向页面推送消息
问题总结
由于ui请求使用的是EventSource,所以service-admin服务会不停的向前置服务trade-application-gateway发送tcp报文,而trade-application-gateway并无此实现,所以并不会向前置nginx等发送报文,故trade-application-gateway前置的tcp连接可正常断开,而和service-admin则无法断开连接。
解决方案
在zuul转发时,添加sensitiveHeaders
配置,忽略accept;再通过wrk测试发现正常;ui请求也已正常
service-admin:
sensitiveHeaders: Accept
path: /admin/**
serviceId: service-admin
stripPrefix: false