首页 > 解决方案 > 创建数据框时出现scala空点异常

问题描述

我正在尝试从某个位置读取文件并将其加载到 spark 数据框中。下面的代码可以正常工作:

 val tempDF:DataFrame=spark.read.orc(targetDirectory)

当我尝试提供相同的架构时,代码失败并出现以下问题:

Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 3, brdn6136.target.com, executor 25): java.lang.NullPointerException
    at org.apache.spark.sql.execution.datasources.orc.OrcColumnVector.getDouble(OrcColumnVector.java:152)
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$10$$anon$1.hasNext(WholeStageCodegenExec.scala:614)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:253)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:247)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:836)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

下面是我使用的代码:

val schema = StructType(
      List(
        StructField("Col1",DoubleType,true),
        StructField("Col2",StringType,true),
        StructField("Col3",DoubleType,true),
        StructField("Col4",DoubleType,true),
        StructField("Col5",DoubleType,true),
        StructField("Col6",StringType,true),
        StructField("Col7",StringType,true),
        StructField("Col8",StringType,true),
        StructField("Col9",StringType,true),
        StructField("Col10",StringType,true),
        StructField("Col11",StringType,true),
        StructField("Col12",StringType,true)
      )
    )
val df:DataFrame=spark.read.format("orc")
        .schema(schema)
      .load(targetReadDirectory)

谁能帮我解决这个问题?

标签: scalaapache-sparkorc

解决方案


推荐阅读