首页 > 解决方案 > Spark Multiple sources found for text

问题描述

I have a Java jar, coming from a Java program, if I run the Java program locally within IntelliJ IDEA, it is working well.

When I have compiled the Java program into a jar file. If I run the program as java -cp jarFileName.jar com.pathToclass.ClassName inputArguments, it works well.

However, when I run as spark-submit --master local[4] --class com.pathToclass.ClassName jarFileName.jar inputArguments, I have the following error when the Java code runs into the read.textFile function.

The code is as follows:

DataFrameReader read = spark.read();
JavaRDD<String> stringJavaRDD = read.textFile(inputPath).javaRDD();

Within the inputPath, are some csv files. The error message when running with spark-submit is as follows:

org.apache.spark.sql.AnalysisException: Multiple sources found for text (org.apache.spark.sql.execution.datasources.v2.text.TextDataSourceV2, org.apache.spark.sql.execution.datasources.text.TextFileFormat), please specify the fully qualified class name.;
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:707)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:733)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:248)
    at org.apache.spark.sql.DataFrameReader.text(DataFrameReader.scala:843)
    at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:880)
    at org.apache.spark.sql.DataFrameReader.textFile(DataFrameReader.scala:852)
    at com.three2three.bigfoot.vola.NormalizeSnapshotSigmaAxisImpliedVola.main(NormalizeSnapshotSigmaAxisImpliedVola.java:306)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
    at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
    at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
    at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
    at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
    at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)

I tried to debug locally within IntelliJ IDEA, when running in IDEA, the data source is found to be org.apache.spark.sql.execution.datasources.text.TextFileFormat.

It seems that when running as spark-submit, this.source() is text, and the scala code found two datasources:

org.apache.spark.sql.execution.datasources.v2.text.TextDataSourceV2
org.apache.spark.sql.execution.datasources.text.TextFileFormat

Why it is like this? Why the code failed only when running in spark-submit mode, and succeeded in other running ways? How to solve the error for running in spark-submit mode?

I tested running with spark-submit. It worked one Linux server, but failed on my windows PC and another linux server (with different version of hadoop and spark).

Updating,in some post, it is claimed that if specifying the format, such mutiple source found for ... error can be avoided. e.g. In this post: https://github.com/AbsaOSS/ABRiS/issues/147, they hard-coded

    df = (
    spark
    .readStream
    .schema(stream_schema)
    .format("org.apache.spark.sql.execution.datasources.json.JsonFileFormat")
    .load("path_to_stream_directory")
)

The "multi soruce found for json" error is gone. Similary, I saw post about format for csv. But I tried with hard-coded format in my case, it does not work either.

enter image description here

标签: javascalaapache-sparkapache-spark-sql

解决方案


I have found a solution. The "Multiple sources found for ..." indicate that multiple packages are found for reading text/csv files when submitting the spark job in spark-submit.

So, it is likely that the multiple versions of the library used for reading text/csv files have been found.

I assume the cause are as follows: I compiled my java code with gradle on my Windows pc with particular hadoop/spark version. I have run the spark-submit --someCofigaration myjar.jar --some parameters locally on my windows PC and on different linux server. The version specified in the gradle.build file may not be the same as on my Windows pc. Lucikly it is the same with the version on one of the linux server, and different with the version on another linux server. That is why the spark-submit job only succeed on one of the linux server and failed on the other one and the Windows pc.

After realizing it could potentially the problem of version conflicts, I re-installed the most recent versions on my pc/linux and the spark-submit works well, without the error of "mutilple source found for ...".

The versions I am currently using are as follows:

Hadoop: hadoop-3.2.2

Spark: spark-3.1.1-bin-hadoop3.2

java: openjdk version “1.8.0_282“ (Java 8)

Flume: apache-flume-1.9.0-bin

Kafka: kafka_2.13-2.7.0

Scala: scala-2.12.13.deb

sbt: sbt-1.5.0.tgz

I am not sure whether my answer is indeed the correct one as I am relatively new to hadoop/spark/java. If someone knows the reason in details, please post your answer.


推荐阅读