apache-spark - 使用 spark 时无法应用 gpfdist 协议
问题描述
我正在尝试使用 spark 将数据从 greenplum 读取到 HDFS 中。为此,我正在使用 jar 文件:greenplum-spark_2.11-1.6.0.jar
应用 spark.read 如下:
val yearDF = spark.read.format("io.pivotal.greenplum.spark.GreenplumRelationProvider").option("url", "jdbc:postgresql://1.2.3.166:5432/finance?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory").option("server.port","8020").option("dbtable", "tablename").option("dbschema","schema").option("user", "123415").option("password", "etl_123").option("partitionColumn","je_id").option("partitions",3).load().where("period_year=2017 and period_num=12 and source_system_name='SSS'").select(splitSeq map col:_*).withColumn("flagCol", lit(0))
yearDF.write.format("csv").save("hdfs://dev/apps/hive/warehouse/header_test_data/")
当我运行上面的代码时,我得到了异常:
Exception in thread "qtp1438055710-505" java.lang.OutOfMemoryError: GC overhead limit exceeded
19/03/05 12:29:08 WARN QueuedThreadPool:
java.lang.OutOfMemoryError: GC overhead limit exceeded
19/03/05 12:29:08 WARN QueuedThreadPool: Unexpected thread death: org.eclipse.jetty.util.thread.QueuedThreadPool$3@16273740 in qtp1438055710{STARTED,8<=103<=200,i=19,q=0}
19/03/05 12:36:03 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 8)
org.postgresql.util.PSQLException: ERROR: error when writing data to gpfdist http://1.2.3.8:8020/spark_6ca7d983d07129f2_db5510e67a8a6f78_driver_370, quit after 2 tries (url_curl.c:584) (seg7 ip-1-3-3-196.ec2.internal:40003 pid=4062) (cdbdisp.c:1322)
at org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2310)
at org.postgresql.core.v3.QueryExecutorImpl.processResults(QueryExecutorImpl.java:2023)
at org.postgresql.core.v3.QueryExecutorImpl.execute(QueryExecutorImpl.java:217)
at org.postgresql.jdbc.PgStatement.execute(PgStatement.java:421)
at org.postgresql.jdbc.PgStatement.executeWithFlags(PgStatement.java:318)
at org.postgresql.jdbc.PgStatement.executeUpdate(PgStatement.java:294)
at com.zaxxer.hikari.pool.ProxyStatement.executeUpdate(ProxyStatement.java:120)
at com.zaxxer.hikari.pool.HikariProxyStatement.executeUpdate(HikariProxyStatement.java)
at io.pivotal.greenplum.spark.jdbc.Jdbc$$anonfun$2.apply(Jdbc.scala:81)
at io.pivotal.greenplum.spark.jdbc.Jdbc$$anonfun$2.apply(Jdbc.scala:79)
at resource.AbstractManagedResource$$anonfun$5.apply(AbstractManagedResource.scala:88)
at scala.util.control.Exception$Catch$$anonfun$either$1.apply(Exception.scala:125)
at scala.util.control.Exception$Catch$$anonfun$either$1.apply(Exception.scala:125)
at scala.util.control.Exception$Catch.apply(Exception.scala:103)
at scala.util.control.Exception$Catch.either(Exception.scala:125)
at resource.AbstractManagedResource.acquireFor(AbstractManagedResource.scala:88)
at resource.ManagedResourceOperations$class.apply(ManagedResourceOperations.scala:26)
at resource.AbstractManagedResource.apply(AbstractManagedResource.scala:50)
at resource.DeferredExtractableManagedResource$$anonfun$tried$1.apply(AbstractManagedResource.scala:33)
at scala.util.Try$.apply(Try.scala:192)
at resource.DeferredExtractableManagedResource.tried(AbstractManagedResource.scala:33)
at io.pivotal.greenplum.spark.jdbc.Jdbc$.copyTable(Jdbc.scala:83)
at io.pivotal.greenplum.spark.externaltable.GreenplumRowIterator.liftedTree1$1(GreenplumRowIterator.scala:105)
at io.pivotal.greenplum.spark.externaltable.GreenplumRowIterator.<init>(GreenplumRowIterator.scala:104)
at io.pivotal.greenplum.spark.GreenplumRDD.compute(GreenplumRDD.scala:49)
我按照官方文档中提到的步骤应用了这些步骤
早些时候我使用了 jar:greenplum.jar
它工作正常,但速度较慢,因为它通过 GP Master 提取数据。jar:greenplum-spark_2.11-1.6.0.jar
是一个连接器 jar,它使用gpfdist
协议将数据拉到 HDFS。
异常消息中的 IP 地址也发生了变化。你可以看到IP1.2.3.166:5432
变成1.2.3.8:8020
了seg7 ip-1-3-3-196.ec2.internal:40003 pid=4062
使用相同数量的执行器和执行器内存,我可以使用greenplum.jar
. 但是保持一切不变,只是将罐子改为greenplum-spark_2.11-1.6.0.jar
只面对这个异常。我一直在尝试解决这个问题,但我根本不理解这种现象。谁能让我知道如何解决这个问题?
解决方案
可以增加分区数吗?根据表的大小,您可能需要增加分区数。您可以尝试将分区数增加到 30,看看您是否仍然遇到内存不足的问题?
val yearDF = spark.read.format("io.pivotal.greenplum.spark.GreenplumRelationProvider").option("url", "jdbc:postgresql://1.2.3.166:5432/finance?ssl=true&sslfactory=org.postgresql.ssl.NonValidatingFactory").option("server.port","8020").option("dbtable", "tablename").option("dbschema","schema").option("user", "123415").option("password", "etl_123").option("partitionColumn","je_id").option("partitions",30).load().where("period_year=2017 and period_num=12 and source_system_name='SSS'").select(splitSeq map col:_*).withColumn("flagCol", lit(0))
yearDF.write.format("csv").save("hdfs://dev/apps/hive/warehouse/header_test_data/")
推荐阅读
- c# - .NET Core 3 迁移的其他探测路径
- javascript - 交替合并两个不同长度的数组,JavaScript
- javascript - JS/HTML:合并从表中的列中选择的多个
- matlab - 如何使用 matlab2018a 获得希尔伯特边缘谱?
- jms - 无法使用 Solace JMS 订阅持久主题
- javascript - Gijgo Datepicker 上个月和下个月图标更改
- arduino - 如何仅在使用 Arduino 按住按钮时激活此音频和 LED?
- ruby - Jekyll:尽管安装了最新版本,但找不到命令问题
- javascript - 不知道我的简单代码有什么问题
- python - Swift/Python 引用计数差异