首页 > 解决方案 > GC CMS Final Remark - 耗时太长

问题描述

我们有..a GC..issue,VM 将在哪里运行非常长的 GC(1 秒以上..我们的 SLA 要求 <200ms)。

(这是在 Java 8 上,我们从版本 11 低谷到最新都看到了这一点)。

我们的 GC 设置 atm:

-XX:+PrintStringTableStatistics
-XX:+PrintAdaptiveSizePolicy
-XX:+ParallelRefProcEnabled
-XX:+CMSIncrementalMode
-XX:+CMSIncrementalPacing
-XX:+UseThreadPriorities
-XX:ThreadPriorityPolicy=42
-Xms3000M
-Xmx3000M
-Xmn1500M
-XX:+HeapDumpOnOutOfMemoryError
-Xss256k
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:+CMSParallelRemarkEnabled
-XX:+ParallelRefProcEnabled
-XX:SurvivorRatio=8
-XX:MaxTenuringThreshold=1
-XX:CMSInitiatingOccupancyFraction=75
-XX:+UseCMSInitiatingOccupancyOnly
-XX:+UseTLAB
-XX:+PerfDisableSharedMem
-XX:CompileCommandFile=./../conf/hotspot_compiler
-XX:CMSWaitDuration=10000
-XX:+CMSParallelInitialMarkEnabled
-XX:+CMSEdenChunksRecordAlways
-XX:CMSWaitDuration=10000
-XX:+UseCondCardMark
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintPromotionFailure
-Xloggc:/opt/openmarket/sqs/logs/gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=5
-XX:GCLogFileSize=100M

我们尝试过 G1,但情况更糟,所以我们“坚持”使用 CMS atm。

无论如何,这是我们 GC 日志的片段。

应该注意的是,在大约 24 小时的运行中,它执行了大约 4 个,其中 2 个......是“长”的......(事实上,我们在过去的 prod 中看到了更长的时间,超过 12 秒!!)

我不明白的一件事是为什么我们看到 -> 擦洗字符串表,0.6001941 秒花了这么长时间,我们没有“实习”很多我们的字符串,启用 -XX:+PrintStringTableStatistics 没有显示有趣的活动,从我们可以看出,我们总共只有不到 1000 个实习字符串,所以这不应该是因素。

gc 日志的高级摘要可以在这里看到:

http://gceasy.io/my-gc-report.jsp?p=c2hhcmVkLzIwMTgvMDcvMjYvLS1ub2RlLTAwMS4zLS0xNS0yMi0zNA==

2018-07-25T18:15:52.608+0000: 30.543: Total time for which application threads were stopped: 0.0046391 seconds
 CMS: abort preclean due to time 2018-07-25T18:15:53.553+0000: 31.488: [CMS-concurrent-abortable-preclean: 0.326/20.683 secs] [Times: user=25.25 sys=3.89, real=20.68 secs] 
2018-07-25T18:15:53.557+0000: 31.492: [GC (CMS Final Remark) [YG occupancy: 588025 K (1382400 K)]31.492: [Rescan (parallel) , 0.0559721 secs]31.548: [weak refs processing, 0.0035595 secs]31.552: [class unloading, 0.0409634 secs]31.593: [scrub symbol table, 0.0051572 secs]31.598: [scrub string table, 0.0016212 secs][1 CMS-remark: 12246K(1536000K)] 600272K(2918400K), 0.1150145 secs] [Times: user=0.34 sys=0.00, real=0.11 secs] 
2018-07-25T18:15:53.673+0000: 31.608: Total time for which application threads were stopped: 0.1195510 seconds
2018-07-25T18:15:53.673+0000: 31.608: [CMS-concurrent-sweep-start]
--
2018-07-25T18:18:45.477+0000: 203.412: Total time for which application threads were stopped: 0.0044451 seconds
 CMS: abort preclean due to time 2018-07-25T18:18:51.809+0000: 209.743: [CMS-concurrent-abortable-preclean: 3.023/67.120 secs] [Times: user=18.52 sys=6.47, real=67.11 secs] 
2018-07-25T18:18:51.812+0000: 209.747: [GC (CMS Final Remark) [YG occupancy: 595711 K (1382400 K)]209.747: [Rescan (parallel) , 0.1225090 secs]209.870: [weak refs processing, 0.0048343 secs]209.875: [class unloading, 0.0408876 secs]209.916: [scrub symbol table, 0.0047264 secs]209.920: [scrub string table, 0.0042791 secs][1 CMS-remark: 54877K(1536000K)] 650589K(2918400K), 0.1843075 secs] [Times: user=0.54 sys=0.00, real=0.18 secs] 
2018-07-25T18:18:51.997+0000: 209.932: Total time for which application threads were stopped: 0.1885355 seconds
2018-07-25T18:18:52.002+0000: 209.937: Total time for which application threads were stopped: 0.0041075 seconds
--
2018-07-26T00:06:31.084+0000: 21069.019: Total time for which application threads were stopped: 0.0033592 seconds
 CMS: abort preclean due to time 2018-07-26T00:06:56.329+0000: 21094.264: [CMS-concurrent-abortable-preclean: 3.400/86.919 secs] [Times: user=15.48 sys=7.48, real=86.91 secs] 
2018-07-26T00:06:56.339+0000: 21094.274: [GC (CMS Final Remark) [YG occupancy: 604677 K (1382400 K)]21094.274: [Rescan (parallel) , 0.1252722 secs]21094.399: [weak refs processing, 0.0084571 secs]21094.408: [class unloading, 0.1359008 secs]21094.544: [scrub symbol table, 0.0037596 secs]21094.548: [scrub string table, 0.5171769 secs][1 CMS-remark: 1274484K(1536000K)] 1879161K(2918400K), 0.7978841 secs] [Times: user=1.13 sys=0.00, real=0.80 secs] 
2018-07-26T00:06:57.138+0000: 21095.072: Total time for which application threads were stopped: 0.8018055 seconds
2018-07-26T00:06:57.138+0000: 21095.073: [CMS-concurrent-sweep-start]
--
2018-07-26T09:05:41.291+0000: 53419.226: Total time for which application threads were stopped: 0.0053130 seconds
 CMS: abort preclean due to time 2018-07-26T09:06:08.087+0000: 53446.021: [CMS-concurrent-abortable-preclean: 2.304/99.556 secs] [Times: user=15.49 sys=8.46, real=99.54 secs] 
2018-07-26T09:06:08.090+0000: 53446.025: [GC (CMS Final Remark) [YG occupancy: 605003 K (1382400 K)]53446.025: [Rescan (parallel) , 0.1410524 secs]53446.166: [weak refs processing, 0.0134749 secs]53446.180: [class unloading, 1.0563894 secs]53447.236: [scrub symbol table, 0.0137832 secs]53447.250: [scrub string table, 0.6001941 secs][1 CMS-remark: 1431566K(1536000K)] 2036569K(2918400K), 1.8335188 secs] [Times: user=1.28 sys=0.00, real=1.83 secs] 
2018-07-26T09:06:09.924+0000: 53447.859: Total time for which application threads were stopped: 1.8372996 seconds

更新:这里是不同运行的结果,标志更少(但基本上相同或更糟)(在 A 侧节点上..我们也一直在使用 parallelOld 进行测试,看起来很有希望)

旗帜:

-XX:+CMSClassUnloadingEnabled
-Xms3000M
-Xmx3000M
-Xmn1500M
-XX:+HeapDumpOnOutOfMemoryError
-Xss256k
-XX:+UseParNewGC
-XX:+UseConcMarkSweepGC
-XX:MaxTenuringThreshold=10
-XX:TargetSurvivorRatio=90
-XX:+UseTLAB
-XX:+UseCondCardMark
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintTenuringDistribution
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintPromotionFailure
-Xloggc:/opt/openmarket/sqs/logs/gc.log
-XX:+UseGCLogFileRotation
-XX:NumberOfGCLogFiles=5
-XX:GCLogFileSize=100M

GC 日志片段:

018-07-29T09:22:01.752+0000:38922.964:[CMS-concurrent-abortable-preclean-start] CMS:由于时间 2018-07-29T09:22:06.888+0000:38928.100:[CMS-concurrent] 中止预清理-abortable-preclean: 5.118/5.136 secs] [Times: user=5.09 sys=0.10, real=5.14 secs] 2018-07-29T09:22:06.890+0000: 38928.102: [GC (CMS Final Remark) [YG 占用: 184100 K (1382400 K)]2018-07-29T09:22:06.890+0000:38928.102:[重新扫描(并行),0.0609614 秒]2018-07-29T09:22:06.951+0000:38928.163:[6s28.163:600000000000000000000 secs]2018-07-29T09:22:06.954+0000:38928.166:[类卸载,0.0369107 secs]2018-07-29T09:22:06.991+0000:38928.203:[scrub 符号表,0.0024052 secs0.0024052 secs]9018-2 :22:06.993+0000: 38928.205: [scrub string table, 1.0810044 secs][1 CMS-remark: 1414399K(1536000K)] 1598500K(2918400K), 1.1918919 secs] [Times: user=1.17 sys=0.0019 秒] 2018-07-29T09:22:08.082+0000: 38929.294:应用程序线程停止的总时间:1.1943563 秒,停止线程花费:0.0001797 秒

2018-07-29T18:52:55.303+0000:73176.515:[CMS-concurrent-abortable-preclean-start] CMS:由于时间原因中止预清理 2018-07-29T18:53:00.352+0000:73181.564:[CMS-并发-abortable-preclean: 5.029/5.049 secs] [Times: user=5.83 sys=0.31, real=5.05 secs] 2018-07-29T18:53:00.355+0000: 73181.567: [GC (CMS Final Remark) [YG 占用: 185778 K (1382400 K)]2018-07-29T18:53:00.355+0000:73181.567:[重新扫描(并行),0.0650261 秒]2018-07-29T18:53:00.420+0000:73181.632:[5sref51.632:[弱处理secs]2018-07-29T18:53:00.423+0000:73181.635:[类卸载,0.0257172 secs]2018-07-29T18:53:00.448+0000:73181.660:[scrub 符号表,0.0020711 secs18-29]2018-29 :53:00.451+0000: 73181.663: [scrub string table, 0.9664343 secs][1 CMS-remark: 1414468K(1536000K)] 1600246K(2918400K), 1.0629217 secs] [Times: user=1.11 sys=0.0006 秒] 2018-07-29T18:53:01.418+0000: 73182.630: 应用程序线程停止的总时间:1.0657259 秒,停止线程花费:0.0002660 秒

2018-07-30T12:30:12.498+0000:136613.710:[CMS-concurrent-abortable-preclean-start] CMS:由于时间原因中止预清理 2018-07-30T12:30:17.698+0000:136618.910:[CMS-concurrent -abortable-preclean: 5.185/5.200 secs] [Times: user=5.72 sys=0.20, real=5.20 secs] 2018-07-30T12:30:17.701+0000: 136618.913: [GC (CMS Final Remark) [YG occupancy: 179068 K (1382400 K)]2018-07-30T12:30:17.701+0000:136618.913:[重新扫描(并行),0.0632593 秒]2018-07-30T12:30:17.764+0000:136618.976:[微处理]秒]2018-07-30T12:30:17.767+0000:136618.979:[类卸载,0.1576453 秒]2018-07-30T12:30:17.924+0000:136619.136:[擦洗符号表,0.008-07-250 秒]2 :30:17.929+0000: 136619.141: [scrub string table, 1.0465825 secs][1 CMS-remark: 1414152K(1536000K)] 1593220K(2918400K), 1.2744338 secs] [Times: user=1.21 sys=0.28 秒] 2018-07-30T12:30:18.976+0000: 136620.188:应用程序线程停止的总时间:1.2772222 秒,停止线程花费:0.0001826 秒

标签: javagarbage-collection

解决方案


推荐阅读