首页 > 解决方案 > 将 py 文件添加到 spark scala

问题描述

我正在尝试从 Scala Spark 作业(Spark 2.3)执行 python 脚本,如下所示

val pyScript = "wasb://scripts@myAccount.blob.core.windows.net/print.py"
val pyScriptName = "print.py"
spark.sparkContext.addFile(pyScript)
val ipData = sc.parallelize(List("abc","def","ghi"))
val opData = ipData.pipe(org.apache.spark.SparkFiles.get(pyScriptName))
opData.foreach(println)

但是我得到以下异常。有什么想法可能是错的吗?

org.apache.spark.SparkException: Job aborted due to stage failure: Task 2 in stage 2.0 failed 4 times, most recent failure: Lost task 2.3 in stage 2.0 (TID 141, wn4-novakv.oetevw42cdoe3jls1dzdeclktg.ex.internal.cloudapp.net, executor 2): java.io.IOException: Cannot run program "/mnt/resource/hadoop/yarn/local/usercache/livy/appcache/application_1539905356890_0003/spark-5da59a08-a6a4-443d-b3b1-c31643e195c5/userFiles-e3ca8a1b-44f6-4804-9c95-625b1742fb77/print.py": error=2, No such file or directory
    at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
    at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:111)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:288)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.io.IOException: error=2, No such file or directory

标签: scalaapache-spark

解决方案


推荐阅读