pyspark - 当我厌倦了在 pyspark 中加载 csv 时出现错误
问题描述
我已经导入了 mmlspark 来使用 LightGBM,如果我不这样做,任何事情都很好。
spark = pyspark.sql.SparkSession.builder.appName("MyApp") \
.config("spark.jars.packages", "com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc3") \
.config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven") \
.getOrCreate()
train_df = spark.read.csv('/content/drive/My Drive/BDCproj/train.csv', header=True, inferSchema=True)
test_df = spark.read.csv('/content/drive/My Drive/BDCproj/test.csv', header=True, inferSchema=True)
然后我的错误:
Py4JJavaError Traceback (most recent call last)
<ipython-input-55-ba0da364400e> in <module>()
----> 1 train_df = spark.read.csv('/content/drive/My Drive/BDCproj/train.csv', header=True, inferSchema=True)
2 test_df = spark.read.csv('/content/drive/My Drive/BDCproj/test.csv', header=True, inferSchema=True)
3 frames
/usr/local/lib/python3.6/dist-packages/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling o214.csv.
: java.util.ServiceConfigurationError: org.apache.spark.sql.sources.DataSourceRegister: Provider org.apache.spark.sql.avro.AvroFileFormat could not be instantiated
at java.base/java.util.ServiceLoader.fail(ServiceLoader.java:581)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:803)
at java.base/java.util.ServiceLoader$ProviderImpl.get(ServiceLoader.java:721)
at java.base/java.util.ServiceLoader$3.next(ServiceLoader.java:1394)
at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:44)
at scala.collection.Iterator.foreach(Iterator.scala:941)
at scala.collection.Iterator.foreach$(Iterator.scala:941)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
at scala.collection.IterableLike.foreach(IterableLike.scala:74)
at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
at scala.collection.AbstractIterable.foreach(Iterable.scala:56)
at scala.collection.TraversableLike.filterImpl(TraversableLike.scala:255)
at scala.collection.TraversableLike.filterImpl$(TraversableLike.scala:249)
at scala.collection.AbstractTraversable.filterImpl(Traversable.scala:108)
at scala.collection.TraversableLike.filter(TraversableLike.scala:347)
at scala.collection.TraversableLike.filter$(TraversableLike.scala:347)
at scala.collection.AbstractTraversable.filter(Traversable.scala:108)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:649)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:733)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:248)
at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:723)
at jdk.internal.reflect.GeneratedMethodAccessor16.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Thread.java:834)
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/sql/execution/datasources/FileFormat$class
at org.apache.spark.sql.avro.AvroFileFormat.<init>(AvroFileFormat.scala:44)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at java.base/jdk.internal.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.base/java.lang.reflect.Constructor.newInstance(Constructor.java:490)
at java.base/java.util.ServiceLoader$ProviderImpl.newInstance(ServiceLoader.java:779)
... 29 more
我的火花是 3.0.1
解决方案
尝试使用此语法一次。如果它有帮助,请给它一个绿色检查。
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.appName("MyApp").config("spark.jars.packages","com.microsoft.ml.spark:mmlspark_2.11:1.0.0-rc3").config("spark.jars.repositories", "https://mmlspark.azureedge.net/maven").getOrCreate()
train = spark.read.option("header",True).csv("/complete/path/to/train.csv")
test = spark.read.option("header",True).csv("/complete/path/to/test.csv")
希望这有效!
推荐阅读
- google-chrome - 从所有浏览器到 Android 上的 google chrome 的深层链接
- javascript - 使用jquery单击加号时如何展开最近的div?
- postgresql - PostgreSQL JDBC 驱动程序从日志记录中隐藏准备好的语句参数
- azure - 使用默认 Databricks 文件系统或默认 DBFS 或 DBFS 根的哪种数据加密?
- javascript - 无法将文件对象从 reactjs 发送到 nodejs
- node.js - 在 node.js 中使用来自控制器的 fastify-redis
- oracle - 在 Oracle 12c 中创建具有分区年和子分区月的表
- android - 最适合创建闹钟应用程序的语言/框架?
- swift - 在切换到 UITabbarController 中的不同选项卡之前关闭当前选项卡上的推送视图控制器
- sap-cloud-platform - 错误 - 无法为本地连接添加“SAP-Connectivity-Authentication”标头