首页 > 解决方案 > pySpark take() 返回错误:TypeError: 'int' object is not iterable

问题描述

我正在尝试学习 pySpark。我运行以下命令:

from pyspark.sql import SparkSession

spark = SparkSession.builder \
   .master("local") \
   .appName("Linear Regression Model - 2") \
   .config("spark.executor.memory", "1gb") \
   .getOrCreate()

sc = spark.sparkContext

print(sc) 

没关系并返回:

SparkContext master=local appName=线性回归模型 - 2

然后,我阅读了一个文本文件并显示了它的值:

header = sc.textFile('cal_housing.domain')
header.collect()

我从这里得到的输入文件:链接

它适用于以下输出:

['经度:连续。','纬度:连续。','housingMedianAge:连续。', 'totalRooms:连续。', 'totalBedrooms:连续。', '人口:连续。'、'户:连续。', 'medianIncome:连续。', 'medianHouseValue:连续。']

据我所知,如果我们的数据量很大,我们应该取而代之。但是,当我使用take()函数时,它返回一个错误:

header.take(2)

我试图弄清楚为什么它返回TypeError: 'int' object is not iterable但我不能。据我了解,header是一个RDD object,它包含一个标题列表。因此,它应该是可迭代的,而不是int

如何解决此错误:

错误信息:

Py4JJavaError:调用 z:org.apache.spark.api.python.PythonRDD.runJob 时出错。:org.apache.spark.SparkException:作业因阶段失败而中止:阶段 3.0 中的任务 0 失败 1 次,最近一次失败:阶段 3.0 中丢失任务 0.0(TID 3,本地主机,执行程序驱动程序):org.apache.spark .api.python.PythonException:回溯(最近一次调用最后):文件“/usr/local/spark-2.2.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,行167、在main func、profiler、deserializer、serializer = read_command(pickleSer, infile) File "/usr/local/spark-2.2.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py ",第 56 行,在 read_command command = serializer._read_with_length(file) 文件中 "/usr/local/spark-2.2.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py",

在 org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:194) 在 org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:235) 在org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:153) 在 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:64) 在 org.apache.spark.rdd。 RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org .apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor .java:1149) 在 java.lang.Thread 的 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)。运行(线程.java:748)

驱动程序堆栈跟踪:在 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1 的 org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1533)。在 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala: 59) 在 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1520) 在 org.apache.spark.scheduler.DAGScheduler$$ 的 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814) at scala.Option。foreach(Option.scala:257) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1748) at org.apache .spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1703) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1692) at org.apache.spark.util.EventLoop$$anon$1.run (EventLoop.scala:48) 在 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) 在 org.apache.spark.SparkContext.runJob(SparkContext.scala:2029) 在 org.apache.spark。 SparkContext.runJob(SparkContext.scala:2050) 在 org.apache.spark.SparkContext.runJob(SparkContext.scala:2069) 在 org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:463) at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala) at sun.reflect.GeneratedMethodAccessor54.invoke(Unknown Source) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java .lang.reflect.Method.invoke(Method.java:498) 在 py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) 在 py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) 在 py4j.Gateway。在 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) 处调用(Gateway.java:282) 在 py4j.GatewayConnection.run(GatewayConnection.java:238) 处 py4j.commands.CallCommand.execute(CallCommand.java:79) ) 在 java.lang.Thread.run(Thread.java:748) 引起:org.apache.spark.api.python.PythonException:回溯(最近一次调用最后):文件“/usr/local/spark-2.2.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py”,第 167 行,在 main func、profiler、deserializer、serializer = read_command(pickleSer, infile) 文件“/usr/local/spark -2.2.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py",第 56 行,在 read_command command = serializer._read_with_length(file) File "/usr/local/spark-2.2. 2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py”,第 169 行,在 _read_with_length 返回 self.loads(obj) 文件“/usr/local/spark-2.2.2-bin- hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py”,第 455 行,在负载返回 pickle.loads(obj, encoding=encoding) 文件“/usr/local/spark-2.2.2-bin- hadoop2.7/python/lib/pyspark.zip/pyspark/cloudpickle.py”,第 784 行,在 _make_skel_func 闭包 = _reconstruct_closure(closures) 如果闭包 else None 文件“/usr/local/spark-2.2.2-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/cloudpickle.py”,第 776 行,在_reconstruct_closure return tuple([_make_cell(v) for v in values]) TypeError: 'int' object is not iterable

在 org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:194) 在 org.apache.spark.api.python.PythonRunner$$anon$1.(PythonRDD.scala:235) 在org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:153) 在 org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:64) 在 org.apache.spark.rdd。 RDD.computeOrReadCheckpoint(RDD.scala:323) at org.apache.spark.rdd.RDD.iterator(RDD.scala:287) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87) at org .apache.spark.scheduler.Task.run(Task.scala:109) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor .java:1149) 在 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 还有 1 个

标签: apache-sparkpyspark

解决方案


推荐阅读