pyspark - pyspark foreach/foreachPartition 发送 http 请求失败
问题描述
我urllib.request
用来发送 http 请求foreach/foreachPartition
。pyspark
抛出错误如下:
objc[74094]: +[__NSPlaceholderDate initialize] may have been in progress in another thread when fork() was called. We cannot safely call it or ignore it in the fork() child process. Crashing instead. Set a breakpoint on objc_initializeAfterForkError to debug.
20/07/20 19:05:58 ERROR Executor: Exception in task 7.0 in stage 0.0 (TID 7)
org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:536)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator$$anonfun$1.applyOrElse(PythonRunner.scala:525)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:643)
at org.apache.spark.api.python.PythonRunner$$anon$3.read(PythonRunner.scala:621)
at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:456)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at org.apache.spark.InterruptibleIterator.foreach(InterruptibleIterator.scala:28)
at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62)
at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49)
at scala.collection.TraversableOnce.to(TraversableOnce.scala:315)
at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313)
at org.apache.spark.InterruptibleIterator.to(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307)
at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307)
at org.apache.spark.InterruptibleIterator.toBuffer(InterruptibleIterator.scala:28)
at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294)
at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288)
at org.apache.spark.InterruptibleIterator.toArray(InterruptibleIterator.scala:28)
at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1004)
at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2133)
_
当我打电话时rdd.foreach(send_http), rdd=sc.parallelize(["http://192.168.1.1:5000/index.html"])
,
send_http
定义如下:
def send_http(url):
req = urllib.request.Request(url)
resp = urllib.request.urlopen(req)
谁能告诉我这个问题?谢谢。
解决方案
这是@tmylt 在上面评论中的回答,但我也可以确认使用http.client
而不是requests.get
确实有效。我确信发生这种情况是有原因的,但是使用 python http.client 是一个快速修复。
推荐阅读
- javascript - 尝试导入错误:“Routes”未从“react-router-dom”导出
- amazon-web-services - 使用 bitbucket 和 terraform 部署后无法连接到 EKS 集群
- amazon-web-services - 使用 AWS Lex,有没有办法使用 lambda 代码钩子找出 Lex 要填充的意图槽值?
- google-cloud-platform - 我可以为 Google OAuth 同意电子邮件使用别名域吗
- mysql - 要显示数据库中的表格和下拉列表,Django SQL 数据库
- python-3.x - AttributeError:模块“sklearn.metrics”没有属性“项目”
- javascript - 单击googlemap javascript上的相应标记时如何在mysql中显示特定ID的实时图表
- node.js - Azure 函数中的 Sharp 问题
- python - 从python中的字符串中删除“none”:循环
- google-apps-script - Google 表格 - 在 A 列上启用复选框并将该数据发送到另一张表格后,如何复制 11 列数据?