20220720-记录一次生产导出问题

问题

线上导出订单时,节点未重启,但是调接口不通,原因是调用三方接口超时错误,查看gc、cpu&内存监控,发现

  • cpu idle 40%
  • mem.used 3730m,容器是4g 堆大小为3g
  • gc time 大幅增加

故问题原因是由于频繁gc导致 stw,故无法在timeout时间内处理请求

导出订单数量3w+

临时方案

修改网关接口路由配置,导出接口路由至单独的小流量集群

模拟环境

  • 由于sim环境数据较少,故通过模拟的方式导出,即调整代码逻辑,一直导出第一页订单,循环1000次
  • 导出六万条订单数据正常
  • 导出正常的原因是因为导出时使用了重复的订单,导致所有的字符串使用的均为常量字符串比线上正常的导出节省了大量的内存空间,故在模拟导出时,针对字符串添加随机数,发现导出时调用三方依赖超时
  • 结果由于一直full gc stw导致无法处理程序功能

分析

分析gclog

  • 通过-Xloggc:logs/gc.log打印出gc.log
  • 通过gceasy分析gc情况

具体gclog见附

jstat

  • gcutil
$ jstat -gcutil 1207
  S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT
100.00   0.00 100.00 100.00  95.38  93.65     43   11.052    81  367.720  378.771
  • gc
$  jstat -gc 1207
 S0C    S1C    S0U    S1U      EC       EU        OC         OU       MC     MU    CCSC   CCSU   YGC     YGCT    FGC    FGCT     GCT
104832.0 104832.0 23583.2  0.0   838912.0 838912.0 2097152.0  2097151.9  101368.0 96694.6 12440.0 11649.5     43   11.052  84    388.013  399.065

发现年老代垃圾回收时间非常长

具体释义见附

jmap

  • histo.sh
#!/bin/bash
while [ true ]; do
/bin/sleep 1
echo $(date "+%Y-%m-%d %H:%M:%S") >> jmap.histo.log
# jmap -histo:live $1 |head -n 10 >> jmap.histo.log
jmap -histo $1 |head -n 10 >> jmap.histo.log
echo -e "\n\n\n" >> jmap.histo.log
done
  • heap.sh
#!/bin/bash
while [ true ]; do
/bin/sleep 0.5
echo $(date "+%Y-%m-%d %H:%M:%S") >> jmap.heap.log
jmap -heap $1 >> jmap.heap.log
echo -e "\n\n\n" >> jmap.heap.log
done
  • 执行
./heap.sh 12111
./histo.sh 12111
  • 发现gc均为full gc ,原因为Heap Inspection Initiated GC
    Heap Inspection Initiated GC表示由于jmap -histo:live导致频繁的full gc ,由于在执行时添加了live参数,所以在执行该命令前均会执行一次full gc ;故应避免使用jmap -histo:live因为会频繁的full gc,故在执行jmap时,去掉live
  • 另外 执行heap.sh会严重影响系统的执行,故放弃打印heap
  • histo分析
    • 发现前几个均和打印有关,org.apache.xmlbeans.impl.store.Xobj$ElementXobj占用内存621M
    • [C 占用堆 528M
    • 故一直在stw中,无法处理程序
 num     #instances         #bytes  class name
----------------------------------------------
   1:       6791518      651985728  org.apache.xmlbeans.impl.store.
Xobj$ElementXobj
   2:      10467798      554512856  [C
   3:       5451636      523357056  org.apache.xmlbeans.impl.store.
Xobj$AttrXobj
   4:      10239032      245736768  java.lang.String
   5:       3442905      137716200  java.util.LinkedHashMap$Entry
   6:       2853572      114142880  java.util.TreeMap$Entry
   7:       2788002       89216064  org.apache.poi.xssf.usermodel.X
SSFCell

功能优化

功能分析

Xobj$ElementXobj为入口,通过查阅资料,发现,poi操作,有两种模式,XSSFWorkbook模式 和 SXSSFWorkbook模式。第一种模式,就是现在使用的方式,此方式仅适用与小数据量的poi操作;第二种SXSSFWorkbook,适用与较大数据量的poi操作

调整XSSFWorkbook

调整之前,操作excel

XSSFWorkbook wb = new XSSFWorkbook();
Sheet sheet = wb.createSheet("Sheet1");

调整之后

XSSFWorkbook workbook = new XSSFWorkbook();
SXSSFWorkbook wb = new SXSSFWorkbook(workbook);
Sheet sheet = wb.createSheet("Sheet1");

观察效果

  • histo.sh 1228
    • jmap 虽然有占内存较大的情况,但是能够正常通过YGC回收
  • gc log
    • gceasy分析正常
  • jstat -gcutil 1228
    • 导出期间jstat导出操作前及完成后对比full gc count 为0; YGC 为11
  • 文件导出成功

参考

jvm 启动配置

JVM_ARGS=" -Xms6g -Xmx6g -XX:NewRatio=2 -Xmn2g -XX:+UseConcMarkSweepGC -XX:+UseParNewGC -XX:ParallelGCThreads=4 -XX:+UseCMSInitiatingOccupancyOnly -XX:+PrintTenuringDistribution -XX:-OmitStackTraceInFastThrow -XX:+HeapDumpOnOutOfMemoryError -XX:+DisableExplicitGC -XX:+UseCMSCompactAtFullCollection -XX:CMSInitiatingOccupancyFraction=75 -XX:+UseFastAccessorMethods -Xloggc:logs/gc.log -XX:+PrintGCDetails -XX:+PrintGCDateStamps -XX:+PrintGCTimeStamps -XX:+PrintReferenceGC -XX:+ParallelRefProcEnabled "

jstat gcutil

  • S0:幸存1区当前使用比例
  • S1:幸存2区当前使用比例
  • E:伊甸园区使用比例
  • O:老年代使用比例
  • M:元数据区使用比例
  • CCS:压缩使用比例
  • YGC:年轻代垃圾回收次数
  • FGC:老年代垃圾回收次数
  • FGCT:老年代垃圾回收消耗时间
  • GCT:垃圾回收消耗总时间

jstat gc

S0C:第一个幸存区的大小
S1C:第二个幸存区的大小
S0U:第一个幸存区的使用大小
S1U:第二个幸存区的使用大小
EC:伊甸园区的大小
EU:伊甸园区的使用大小
OC:老年代大小
OU:老年代使用大小
MC:方法区大小
MU:方法区使用大小
CCSC:压缩类空间大小
CCSU:压缩类空间使用大小
YGC:年轻代垃圾回收次数
YGCT:年轻代垃圾回收消耗时间
FGC:老年代垃圾回收次数
FGCT:老年代垃圾回收消耗时间
GCT:垃圾回收消耗总时间

gc log


2022-07-20T20:08:19.253+0800: 1092.135: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2097151K(2097152K)] 2788234K(3040896K), 0.7885338 secs] [Times: user=0.76 sys=0.01, real=0.79 secs]
2022-07-20T20:08:20.042+0800: 1092.924: [CMS-concurrent-mark-start]
2022-07-20T20:08:23.708+0800: 1096.590: [Full GC (Allocation Failure) 2022-07-20T20:08:23.708+0800: 1096.590: [CMS2022-07-20T20:08:25.979+0800: 1098.861: [CMS-concurrent-mark: 3.406/5.937 secs] [Times: user=6.55 sys=0.13, real=5.94 secs]
 (concurrent mode failure)2022-07-20T20:08:28.750+0800: 1101.632: [SoftReference, 265 refs, 0.0000806 secs]2022-07-20T20:08:28.751+0800: 1101.632: [WeakReference, 809 refs, 0.0000928 secs]2022-07-20T20:08:28.751+0800: 1101.632: [FinalReference, 35 refs, 0.0000284 secs]2022-07-20T20:08:28.751+0800: 1101.632: [PhantomReference, 1 refs, 12 refs, 0.0000085 secs]2022-07-20T20:08:28.751+0800: 1101.632: [JNI Weak Reference, 0.0008604 secs]: 2097151K->2097151K(2097152K), 9.2485729 secs] 3040893K->2794721K(3040896K), [Metaspace: 96452K->96452K(1138688K)], 9.2487076 secs] [Times: user=8.86 sys=0.24, real=9.25 secs]
2022-07-20T20:08:35.439+0800: 1108.320: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2097151K(2097152K)] 2800798K(3040896K), 0.7889991 secs] [Times: user=0.76 sys=0.00, real=0.79 secs]
2022-07-20T20:08:36.228+0800: 1109.109: [CMS-concurrent-mark-start]
2022-07-20T20:08:40.075+0800: 1112.957: [Full GC (Allocation Failure) 2022-07-20T20:08:40.075+0800: 1112.957: [CMS2022-07-20T20:08:42.671+0800: 1115.553: [CMS-concurrent-mark: 3.803/6.443 secs] [Times: user=6.77 sys=0.47, real=6.44 secs]
 (concurrent mode failure)2022-07-20T20:08:45.471+0800: 1118.353: [SoftReference, 265 refs, 0.0000841 secs]2022-07-20T20:08:45.472+0800: 1118.353: [WeakReference, 809 refs, 0.0000919 secs]2022-07-20T20:08:45.472+0800: 1118.353: [FinalReference, 36 refs, 0.0000345 secs]2022-07-20T20:08:45.472+0800: 1118.353: [PhantomReference, 0 refs, 12 refs, 0.0000090 secs]2022-07-20T20:08:45.472+0800: 1118.353: [JNI Weak Reference, 0.0009569 secs]: 2097151K->2097151K(2097152K), 9.4983397 secs] 3040895K->2806905K(3040896K), [Metaspace: 96452K->96452K(1138688K)], 9.4985187 secs] [Times: user=8.98 sys=0.39, real=9.49 secs]
2022-07-20T20:08:51.962+0800: 1124.844: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2097151K(2097152K)] 2812304K(3040896K), 0.8163836 secs] [Times: user=0.78 sys=0.01, real=0.82 secs]
2022-07-20T20:08:52.778+0800: 1125.660: [CMS-concurrent-mark-start]
2022-07-20T20:08:56.767+0800: 1129.649: [Full GC (Allocation Failure) 2022-07-20T20:08:56.767+0800: 1129.649: [CMS2022-07-20T20:08:59.049+0800: 1131.931: [CMS-concurrent-mark: 3.743/6.271 secs] [Times: user=6.51 sys=0.14, real=6.27 secs]
 (concurrent mode failure)2022-07-20T20:09:01.937+0800: 1134.819: [SoftReference, 265 refs, 0.0000930 secs]2022-07-20T20:09:01.937+0800: 1134.819: [WeakReference, 809 refs, 0.0001021 secs]2022-07-20T20:09:01.937+0800: 1134.819: [FinalReference, 33 refs, 0.0000242 secs]2022-07-20T20:09:01.937+0800: 1134.819: [PhantomReference, 0 refs, 12 refs, 0.0000107 secs]2022-07-20T20:09:01.937+0800: 1134.819: [JNI Weak Reference, 0.0008914 secs]: 2097151K->2097151K(2097152K), 9.3654498 secs] 3040895K->2818529K(3040896K), [Metaspace: 96455K->96455K(1138688K)], 9.3656169 secs] [Times: user=8.96 sys=0.28, real=9.36 secs]
2022-07-20T20:09:08.522+0800: 1141.404: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2097151K(2097152K)] 2824277K(3040896K), 0.8004286 secs] [Times: user=0.79 sys=0.01, real=0.80 secs]
2022-07-20T20:09:09.323+0800: 1142.205: [CMS-concurrent-mark-start]
2022-07-20T20:09:13.013+0800: 1145.895: [Full GC (Allocation Failure) 2022-07-20T20:09:13.013+0800: 1145.895: [CMS2022-07-20T20:09:15.262+0800: 1148.144: [CMS-concurrent-mark: 3.445/5.939 secs] [Times: user=6.57 sys=0.17, real=5.94 secs]
 (concurrent mode failure)2022-07-20T20:09:17.980+0800: 1150.862: [SoftReference, 265 refs, 0.0000896 secs]2022-07-20T20:09:17.980+0800: 1150.862: [WeakReference, 810 refs, 0.0000974 secs]2022-07-20T20:09:17.980+0800: 1150.862: [FinalReference, 35 refs, 0.0000276 secs]2022-07-20T20:09:17.980+0800: 1150.862: [PhantomReference, 1 refs, 12 refs, 0.0000092 secs]2022-07-20T20:09:17.980+0800: 1150.862: [JNI Weak Reference, 0.0008680 secs]: 2097151K->2097151K(2097152K), 9.0346167 secs] 3040895K->2829316K(3040896K), [Metaspace: 96475K->96475K(1138688K)], 9.0347581 secs] [Times: user=8.81 sys=0.07, real=9.04 secs]
2022-07-20T20:09:24.470+0800: 1157.351: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2097151K(2097152K)] 2834568K(3040896K), 0.8374841 secs] [Times: user=0.81 sys=0.01, real=0.84 secs]
2022-07-20T20:09:25.307+0800: 1158.189: [CMS-concurrent-mark-start]
2022-07-20T20:09:28.899+0800: 1161.781: [Full GC (Allocation Failure) 2022-07-20T20:09:28.899+0800: 1161.781: [CMS2022-07-20T20:09:31.233+0800: 1164.115: [CMS-concurrent-mark: 3.391/5.926 secs] [Times: user=6.48 sys=0.17, real=5.92 secs]
 (concurrent mode failure)2022-07-20T20:09:34.241+0800: 1167.123: [SoftReference, 265 refs, 0.0000771 secs]2022-07-20T20:09:34.241+0800: 1167.123: [WeakReference, 809 refs, 0.0001021 secs]2022-07-20T20:09:34.241+0800: 1167.123: [FinalReference, 36 refs, 0.0000288 secs]2022-07-20T20:09:34.241+0800: 1167.123: [PhantomReference, 1 refs, 12 refs, 0.0000099 secs]2022-07-20T20:09:34.241+0800: 1167.123: [JNI Weak Reference, 0.0008873 secs]: 2097151K->2097151K(2097152K), 9.5285769 secs] 3040894K->2839927K(3040896K), [Metaspace: 96497K->96497K(1138688K)], 9.5287399 secs] [Times: user=9.08 sys=0.29, real=9.53 secs]
2022-07-20T20:09:40.918+0800: 1173.800: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2097151K(2097152K)] 2844533K(3040896K), 0.8230864 secs] [Times: user=0.81 sys=0.01, real=0.82 secs]
2022-07-20T20:09:41.741+0800: 1174.623: [CMS-concurrent-mark-start]
2022-07-20T20:09:45.400+0800: 1178.282: [Full GC (Allocation Failure) 2022-07-20T20:09:45.400+0800: 1178.282: [CMS2022-07-20T20:09:47.807+0800: 1180.688: [CMS-concurrent-mark: 3.420/6.066 secs] [Times: user=6.55 sys=0.25, real=6.07 secs]
 (concurrent mode failure)2022-07-20T20:09:50.648+0800: 1183.530: [SoftReference, 265 refs, 0.0000962 secs]2022-07-20T20:09:50.648+0800: 1183.530: [WeakReference, 809 refs, 0.0000959 secs]2022-07-20T20:09:50.649+0800: 1183.530: [FinalReference, 33 refs, 0.0000237 secs]2022-07-20T20:09:50.649+0800: 1183.530: [PhantomReference, 1 refs, 12 refs, 0.0000095 secs]2022-07-20T20:09:50.649+0800: 1183.530: [JNI Weak Reference, 0.0009076 secs]: 2097151K->2097151K(2097152K), 9.3567976 secs] 3040890K->2850005K(3040896K), [Metaspace: 96500K->96500K(1138688K)], 9.3569471 secs] [Times: user=9.01 sys=0.20, real=9.36 secs]
2022-07-20T20:09:57.381+0800: 1190.263: [GC (CMS Initial Mark) [1 CMS-initial-mark: 2097151K(2097152K)] 2854755K(3040896K), 0.8517348 secs] [Times: user=0.82 sys=0.00, real=0.85 secs]
2022-07-20T20:09:58.233+0800: 1191.115: [CMS-concurrent-mark-start]
2022-07-20T20:10:01.885+0800: 1194.766: [Full GC (Allocation Failure) 2022-07-20T20:10:01.885+0800: 1194.766: [CMS2022-07-20T20:10:04.235+0800: 1197.116: [CMS-concurrent-mark: 3.454/6.001 secs] [Times: user=6.51 sys=0.15, real=6.01 secs]
 (concurrent mode failure)2022-07-20T20:10:07.001+0800: 1199.883: [SoftReference, 265 refs, 0.0000871 secs]2022-07-20T20:10:07.001+0800: 1199.883: [WeakReference, 808 refs, 0.0000942 secs]2022-07-20T20:10:07.001+0800: 1199.883: [FinalReference, 33 refs, 0.0000272 secs]2022-07-20T20:10:07.001+0800: 1199.883: [PhantomReference, 1 refs, 12 refs, 0.0000095 secs]2022-07-20T20:10:07.001+0800: 1199.883: [JNI Weak Reference, 0.0010078 secs]