java - Spark Multiple sources found for text
问题描述
I have a Java jar, coming from a Java program, if I run the Java program locally within IntelliJ IDEA, it is working well.
When I have compiled the Java program into a jar file.
If I run the program as java -cp jarFileName.jar com.pathToclass.ClassName inputArguments
, it works well.
However, when I run as
spark-submit --master local[4] --class com.pathToclass.ClassName jarFileName.jar inputArguments
, I have the following error when the Java code runs into the read.textFile
function.
The code is as follows:
DataFrameReader read = spark.read();
JavaRDD<String> stringJavaRDD = read.textFile(inputPath).javaRDD();
Within the inputPath, are some csv files. The error message when running with spark-submit
is as follows:
org.apache.spark.sql.AnalysisException: Multiple sources found for text (org.apache.spark.sql.execution.datasources.v2.text.TextDataSourceV2, org.apache.spark.sql.execution.datasources.text.TextFileFormat), please specify the fully qualified class name.;
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:707)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:733)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:248)
at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:843)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:880)
at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:852)
at com.three2three.bigfoot.vola.NormalizeSnapshotSigmaAxisImpliedVola.main(NormalizeSnapshotSigmaAxisImpliedVola.java:306)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
I tried to debug locally within IntelliJ IDEA, when running in IDEA, the data source is found to be org.apache.spark.sql.execution.datasources.text.TextFileFormat
.
It seems that when running as spark-submit
, this.source()
is text
, and the scala code found two datasources:
org.apache.spark.sql.execution.datasources.v2.text.TextDataSourceV2
org.apache.spark.sql.execution.datasources.text.TextFileFormat
Why it is like this? Why the code failed only when running in spark-submit
mode, and succeeded in other running ways? How to solve the error for running in spark-submit
mode?
I tested running with spark-submit
. It worked one Linux server, but failed on my windows PC and another linux server (with different version of hadoop and spark).
Updating,in some post, it is claimed that if specifying the format, such mutiple source found for ...
error can be avoided.
e.g. In this post: https://github.com/AbsaOSS/ABRiS/issues/147, they hard-coded
df = (
spark
.readStream
.schema(stream_schema)
.format("org.apache.spark.sql.execution.datasources.json.JsonFileFormat")
.load("path_to_stream_directory")
)
The "multi soruce found for json" error is gone. Similary, I saw post about format for csv. But I tried with hard-coded format in my case, it does not work either.
解决方案
I have found a solution.
The "Multiple sources found for ..." indicate that multiple packages are found for reading text/csv files when submitting the spark job in spark-submit
.
So, it is likely that the multiple versions of the library used for reading text/csv files have been found.
I assume the cause are as follows:
I compiled my java code with gradle on my Windows pc with particular hadoop/spark version. I have run the spark-submit --someCofigaration myjar.jar --some parameters
locally on my windows PC and on different linux server. The version specified in the gradle.build file may not be the same as on my Windows pc. Lucikly it is the same with the version on one of the linux server, and different with the version on another linux server. That is why the spark-submit
job only succeed on one of the linux server and failed on the other one and the Windows pc.
After realizing it could potentially the problem of version conflicts, I re-installed the most recent versions on my pc/linux and the spark-submit
works well, without the error of "mutilple source found for ...".
The versions I am currently using are as follows:
Hadoop: hadoop-3.2.2
Spark: spark-3.1.1-bin-hadoop3.2
java: openjdk version “1.8.0_282“ (Java 8)
Flume: apache-flume-1.9.0-bin
Kafka: kafka_2.13-2.7.0
Scala: scala-2.12.13.deb
sbt: sbt-1.5.0.tgz
I am not sure whether my answer is indeed the correct one as I am relatively new to hadoop/spark/java. If someone knows the reason in details, please post your answer.
推荐阅读
- javascript - 如何将解构导入保存为一个对象的属性?
- sql - SQL Postgres在字符串字段中的单词列表中找到第一个出现的单词
- c# - 如何让两个轮询运行 WPF?
- r - csv read 在 R.. 中向我加载的数据框插入一个“X”列?
- sql - SSIS 将数据从 SQL db 复制到同一 excel 目标上的多个选项卡
- powershell - 如何在组 AD PowerShell 中获取组的用户
- android - 如何在 Android Developer Console 崩溃报告中读取“OR 调用堆栈”?
- html - 引导元素未隐藏
- ios - 是否需要签署付费应用程序协议才能在 iOS 上启用初始 IAP 测试?
- ios - 在 Swift 4 IOS 中将 XML 数据转换为字符串