首页 > 解决方案 > 如何从 spark.log 中定位 gc/内存问题

问题描述

我编写从大表中读取的代码并使用许多withColumn来修改列。我将应用程序部署为纱线上的集群模式,当完成一个阶段时它会停止,并且下一阶段的计算永远不会开始。

经过阅读,我明白当我使用很多时withColumn,会有一个巨大的内存问题,所以我检查了驱动程序的 gc.log,它显示

2019-11-26T17:31:51.105+0800: 537.017: [GC (Allocation Failure) [PSYoungGen: 3034528K->239328K(3514368K)] 18929536K->16368376K(20291584K), 0.3332585 secs] [Times: user=1.05 sys=0.24, real=0.33 secs] 
2019-11-26T17:31:51.438+0800: 537.350: [Full GC (Ergonomics) [PSYoungGen: 239328K->0K(3514368K)] [ParOldGen: 16129048K->15788587K(16777216K)] 16368376K->15788587K(20291584K), [Metaspace: 98385K->98385K(100352K)], 102.4043873 secs] [Times: user=398.46 sys=4.14, real=102.41 secs] 
2019-11-26T17:33:47.348+0800: 653.260: [GC (Allocation Failure) [PSYoungGen: 2824192K->241022K(3504128K)] 18612779K->16029609K(20281344K), 0.2878782 secs] [Times: user=0.79 sys=0.31, real=0.29 secs] 

显然,内存超出了,发生了一次 Full GC,耗时 102 秒。当我替换withColumn为 时createDataFrame,我的应用程序运行顺利。

虽然问题已经确定并解决了,但我想更准确地定位问题,所以下次当有人再次遇到这个问题时,我可以从 spark.log 和 gc.log 中的一些模式中识别它,而无需访问他/她的源代码。但是,当我检查 spark.log 时,我找不到任何与内存分配问题直接相关的内容。

19/11/26 17:31:43 DEBUG Client: IPC Client (xxx) connection to urlA: closed
19/11/26 17:31:43 DEBUG Client: IPC Client (xxx) connection to urlA: stopped, remaining connections 1
19/11/26 17:31:45 DEBUG ApplicationMaster: Number of pending allocations is 0. Slept for 3000/3000.
19/11/26 17:31:45 DEBUG ApplicationMaster: Sending progress
19/11/26 17:31:45 DEBUG YarnAllocator: Updating resource requests, target: 1, pending: 0, running: 1, executorsStarting: 0
19/11/26 17:31:45 TRACE ProtobufRpcEngine: 148: Call -> urlB: allocate {blacklist_request { } response_id: 218 progress: 0.1}
19/11/26 17:31:45 DEBUG Client: IPC Client (xxx) connection to urlB sending #646
19/11/26 17:31:45 DEBUG Client: IPC Client (xxx) connection to urlB got value #646
19/11/26 17:31:45 DEBUG ProtobufRpcEngine: Call: allocate took 1ms
19/11/26 17:31:45 TRACE ProtobufRpcEngine: 148: Response <- urlB: allocate {response_id: 219 limit { } num_cluster_nodes: 960 preempt { strictContract { } }}
19/11/26 17:31:48 DEBUG ApplicationMaster: Number of pending allocations is 0. Slept for 3000/3000.
19/11/26 17:31:48 DEBUG ApplicationMaster: Sending progress
19/11/26 17:31:48 DEBUG YarnAllocator: Updating resource requests, target: 1, pending: 0, running: 1, executorsStarting: 0
19/11/26 17:31:48 TRACE ProtobufRpcEngine: 148: Call -> urlB: allocate {blacklist_request { } response_id: 219 progress: 0.1}
19/11/26 17:31:48 DEBUG Client: IPC Client (xxx) connection to urlB sending #647
19/11/26 17:31:48 DEBUG Client: IPC Client (xxx) connection to urlB got value #647
19/11/26 17:31:48 DEBUG ProtobufRpcEngine: Call: allocate took 1ms
19/11/26 17:31:48 TRACE ProtobufRpcEngine: 148: Response <- urlB: allocate {response_id: 220 limit { } num_cluster_nodes: 960 preempt { strictContract { } }}
19/11/26 17:33:33 DEBUG Client: IPC Client (xxx) connection to urlB : closed
19/11/26 17:33:33 DEBUG DFSClient: DataStreamer block BlockA sending packet packet seqno:-1 offsetInBlock:0 lastPacketInBlock:false lastByteOffsetInBlock: 0
19/11/26 17:33:33 DEBUG Client: IPC Client (xxx) connection to urlB : stopped, remaining connections 0

我的问题是,如果我在驱动程序的 gc.log 中发现 gc 问题,是否可以使用 spark.log 更准确地定位问题。

标签: apache-sparkapache-spark-sqlhadoop-yarn

解决方案


推荐阅读