问题

请求网关转发接口长时间无响应,服务无任何错误日志;但是直接请求服务接口正常。

问题定位

TCP问题排查

  • 通过nestat查看tcp连接
bash-4.3$ netstat -tap
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address           Foreign Address         State       PID/Program name
tcp        0      0 0.0.0.0:http-alt        0.0.0.0:*               LISTEN      1/java
tcp        0      0 trade-application-gateway-6cbfcc888b-69t52:55230 service-admin.default.svc.cluster.local:http-alt ESTABLISHED 1/java
tcp        0      0 trade-application-gateway-6cbfcc888b-69t52:60050 service-admin.default.svc.cluster.local:http-alt ESTABLISHED 1/java
tcp        0      0 trade-application-gateway-6cbfcc888b-69t52:35014 service-admin.default.svc.cluster.local:http-alt ESTABLISHED 1/java
tcp        0      0 trade-application-gateway-6cbfcc888b-69t52:40334 service-admin.default.svc.cluster.local:http-alt ESTABLISHED 1/java
tcp        1      0 localhost:http-alt      localhost:55520         CLOSE_WAIT  1/java
........

发现有大量到service-admin.default.svc.cluster.local:http-alt的连接未释放,总共50个

  • 查看配置文件
bash-4.3$ cat application.yml
spring:
  application:
    name: trade-application-gateway
.....
ribbon:
  MaxTotalConnections: 500
  MaxConnectionsPerHost: 50
....

发现每个服务最大连接为50

  • 查看连接创建时间
bash-4.3$ cat /proc/1/net/tcp
  sl  local_address rem_address   st tx_queue rx_queue tr tm->when retrnsmt   uid  timeout inode
   0: 00000000:1F90 00000000:0000 0A 00000000:00000000 00:00000000 00000000     2        0 4116105505 1 ffff91f1b766cd80 100 0 0 10 0
   1: E5A2050A:A80E 41068898:0050 08 00000000:00000001 00:00000000 00000000     2        0 4155820713 1 ffff91f1a59fae80 20 4 30 10 -1
   2: E5A2050A:E1D0 1F08040A:2387 01 00000000:00000000 02:0009C872 00000000     2        0 4116089078 2 ffff91f1b7668f80 20 4 24 10 -1
   3: E5A2050A:9220 52C5EA3E:0050 08 00000000:00000001 00:00000000 00000000     2        0 4117890325 1 ffff91f1ee5445c0 21 4 28 10 -1
   4: E5A2050A:E4DA 91BA050A:1F90 01 00000000:00000000 00:00000000 00000000     2        0 4116594566 1 ffff91f15587be00 20 4 30 10 -1
   5: E5A2050A:E490 91BA050A:1F90 01 00000000:00000000 00:00000000 00000000     2        0 4116586431 1 ffff91f5fe57f440 20 4 30 10 -1
.......

获取到某个连接的inode,然后查看操作符

bash-4.3$ ls -all /proc/1/fd/ |grep 4116594566
lrwx------    1 daemon   daemon          64 Jun 11 17:29 154 -> socket:[4116594566]

发现该inode是一个socket连接,创建时间为 6.11 17:29

  • 查看连接默认超时时间
sysctl -a |grep keepalive
net.ipv4.tcp_keepalive_time = 7200
net.ipv4.tcp_keepalive_probes = 9
net.ipv4.tcp_keepalive_intvl = 75

TCP连接超时时间正常

抓包-抓取service-admin的tcp报文

  • 查找该容器对应的进程id
docker inspect dac3c06a1c78
......
         "State": {
            "Status": "running",
            "Running": true,
            "Paused": false,
            "Restarting": false,
            "OOMKilled": false,
            "Dead": false,
            "Pid": 28510,
            "ExitCode": 0,
            "Error": "",
            "StartedAt": "2020-06-12T06:05:46.85956101Z",
            "FinishedAt": "0001-01-01T00:00:00Z"
        },
......
  • 进入该进程network namespace
nsenter -n -t 28510
  • 查看tcp连接
netstat -t
  • 抓包
tcpdump
  • 发现service-admin一直在向trade-application-gateway不停的发送报文,导致连接一直未断开
14:47:15.819812 IP (tos 0x0, ttl 64, id 61117, offset 0, flags [DF], proto TCP (6), length 69)
    VM_1_10_centos.webcache > 10.5.162.229.58814: Flags [P.], cksum 0x5c72 (incorrect -> 0xdb55), seq 711925729:711925746, ack 811435878, win 276, options [nop,nop,TS val 297204448 ecr 297518045], length 17: HTTP
	0x0000:  4500 0045 eebd 4000 4006 efba 0a05 a54b  E..E..@.@......K
	0x0010:  0a05 a2e5 1f90 e5be 2a6f 1fe1 305d 8766  ........*o..0].f
	0x0020:  8018 0114 5c72 0000 0101 080a 11b6 fae0  ....\r..........
	0x0030:  11bb c3dd 360d 0a3a 7069 6e67 0a0d 0a31  ....6..:ping...1
	0x0040:  0d0a 0a0d 0a                             .....
14:47:15.819959 IP (tos 0x0, ttl 62, id 39060, offset 0, flags [DF], proto TCP (6), length 52)
    10.5.162.229.58814 > VM_1_10_centos.webcache: Flags [.], cksum 0x08ea (correct), seq 1, ack 17, win 229, options [nop,nop,TS val 297528045 ecr 297204448], length 0
	0x0000:  4500 0034 9894 4000 3e06 47f5 0a05 a2e5  E..4..@.>.G.....
	0x0010:  0a05 a54b e5be 1f90 305d 8766 2a6f 1ff2  ...K....0].f*o..
	0x0020:  8010 00e5 08ea 0000 0101 080a 11bb eaed  ................
	0x0030:  11b6 fae0

上述过程让我们知道了tcp未关闭的原因,但是服务器为什么一直向浏览器发送报文呢,我们继续探究问题本质

通过模拟spring boot admin ui请求定位问题

  • 在spring boot admin ui上我们观测到request header中存在keep-alive和Accept: text/event-stream 比较奇特

WX20200614-173651@2x

  • 通过wrk工具首次模拟service admin ui的请求,添加上Proxy-Connection:keep-alive;发现连接能够正常释放
wrk --latency  -H "Proxy-Connection: keep-alive" -t50 -c50 http://b2c-qa.kanche.com/admin/applications
  • 通过wrk工具第二次模板请求,添加所有的header,通过netstat -atp发现连接一直未释放
wrk --latency  -H "Host: b2c-qa.xxx.com"   -H "Proxy-Connection: keep-alive"   -H "Accept: text/event-stream"   -H "Cache-Control: no-cache"   -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36"   -H "Referer: http://b2c-qa.kanche.com/admin"   -H "Accept-Encoding: gzip, deflate"   -H "Accept-Language: zh-CN,zh;q=0.9,en;q=0.8"   -H "Cookie: _ga=GA1.2.2133472313.1590731594; _fbp=fb.1.1590731632442.287467853" -t1 -c1 http://b2c-qa.xxx.com/admin/applications

WX20200614-174307@2x

为了结果更准确,建议在模拟请求前,重启服务以关闭其它所有的tcp连接

  • 通过wrk工具第三次模板请求,在所有header的基础上去掉Accept: text/event-stream,通过netstat -atp发现连接能够正常释放
wrk --latency  -H "Host: b2c-qa.kanche.com"   -H "Proxy-Connection: keep-alive"     -H "Cache-Control: no-cache"   -H "User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.61 Safari/537.36"   -H "Referer: http://b2c-qa.kanche.com/admin"   -H "Accept-Encoding: gzip, deflate"   -H "Accept-Language: zh-CN,zh;q=0.9,en;q=0.8"   -H "Cookie: _ga=GA1.2.2133472313.1590731594; _fbp=fb.1.1590731632442.287467853" -t1 -c1 http://b2c-qa.kanche.com/admin/applications

WX20200614-174522@2x

故问题在于text/event-stream,经查询资料发现该技术为EventSource,可实现服务器向页面推送消息

问题总结

由于ui请求使用的是EventSource,所以service-admin服务会不停的向前置服务trade-application-gateway发送tcp报文,而trade-application-gateway并无此实现,所以并不会向前置nginx等发送报文,故trade-application-gateway前置的tcp连接可正常断开,而和service-admin则无法断开连接。

解决方案

在zuul转发时,添加sensitiveHeaders配置,忽略accept;再通过wrk测试发现正常;ui请求也已正常

    service-admin:
      sensitiveHeaders: Accept
      path: /admin/**
      serviceId: service-admin
      stripPrefix: false