首页 > 解决方案 > 你如何在 jupyter notebook 中阅读 avros?(皮斯帕克)

问题描述

我无法在 Jupyter Notebook 中阅读 avros。当我使用这些命令时:

import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
path = "C:/Users/hp/avrofile/"
x = spark.read.format("com.databricks.spark.avro").load(path)

我得到这个巨大的错误:

> --------------------------------------------------------------------------- Py4JJavaError                             Traceback (most recent call
> last) <ipython-input-6-16978c1d2487> in <module>
>       1 path = "C:/Users/hp/avrofile/"
> ----> 2 x = spark.read.format("com.databricks.spark.avro").load(path)
> 
> c:\users\hp\appdata\local\programs\python\python37\lib\site-packages\pyspark\sql\readwriter.py
> in load(self, path, format, schema, **options)
>     164         self.options(**options)
>     165         if isinstance(path, basestring):
> --> 166             return self._df(self._jreader.load(path))
>     167         elif path is not None:
>     168             if type(path) != list:
> 
> c:\users\hp\appdata\local\programs\python\python37\lib\site-packages\py4j\java_gateway.py
> in __call__(self, *args)    1255         answer =
> self.gateway_client.send_command(command)    1256         return_value
> = get_return_value(
> -> 1257             answer, self.gateway_client, self.target_id, self.name)    1258     1259         for temp_arg in temp_args:
> 
> c:\users\hp\appdata\local\programs\python\python37\lib\site-packages\pyspark\sql\utils.py
> in deco(*a, **kw)
>      61     def deco(*a, **kw):
>      62         try:
> ---> 63             return f(*a, **kw)
>      64         except py4j.protocol.Py4JJavaError as e:
>      65             s = e.java_exception.toString()
> 
> c:\users\hp\appdata\local\programs\python\python37\lib\site-packages\py4j\protocol.py
> in get_return_value(answer, gateway_client, target_id, name)
>     326                 raise Py4JJavaError(
>     327                     "An error occurred while calling {0}{1}{2}.\n".
> --> 328                     format(target_id, ".", name), value)
>     329             else:
>     330                 raise Py4JError(
> 
> Py4JJavaError: An error occurred while calling o62.load. :
> java.lang.ClassNotFoundException: Failed to find data source:
> org.apache.spark.sql.avro.AvroFileFormat. Please find packages at
> http://spark.apache.org/third-party-projects.html     at
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:657)
>   at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:194)
>   at
> org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)  at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>   at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>   at java.lang.reflect.Method.invoke(Method.java:498)     at
> py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)  at
> py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)    at
> py4j.Gateway.invoke(Gateway.java:282)     at
> py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
>   at py4j.commands.CallCommand.execute(CallCommand.java:79)   at
> py4j.GatewayConnection.run(GatewayConnection.java:238)    at
> java.lang.Thread.run(Thread.java:748) Caused by:
> java.lang.ClassNotFoundException:
> org.apache.spark.sql.avro.AvroFileFormat.DefaultSource    at
> java.net.URLClassLoader.findClass(URLClassLoader.java:382)    at
> java.lang.ClassLoader.loadClass(ClassLoader.java:424)     at
> java.lang.ClassLoader.loadClass(ClassLoader.java:357)     at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
>   at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20$$anonfun$apply$12.apply(DataSource.scala:634)
>   at scala.util.Try$.apply(Try.scala:192)     at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
>   at
> org.apache.spark.sql.execution.datasources.DataSource$$anonfun$20.apply(DataSource.scala:634)
>   at scala.util.Try.orElse(Try.scala:84)  at
> org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:634)
>   ... 13 more

现在,你看,我意识到当我通过 cmd 窗口使用以下命令启动 Pyspark 时:

pyspark --packages org.apache.spark:spark-avro_2.11:2.4.0

我可以阅读 avros 没问题:

x = spark.read.format("avro").load("C:\\Users\\avrofile\\")
x.show(5)

问题是,在 jupyter notebook 中,使用命令“pyspark --packages org.apache.spark:spark-avro_2.11:2.4.0”启动 spark 的等价物是什么?我觉得这是一个非常愚蠢的问题,但是,对不起,我对此很陌生。

太感谢了。

标签: pythonapache-sparkpysparkjupyter-notebookavro

解决方案


检查此解决方案是否适合您,

  • 下载所需的jar spark-avro_2.11-3.2.0.jar,放在正确的文件夹中。这里我指的是例如 c:\users\hp\spark-avro_2.11-3.2.0.jar 位置
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--jars c:\users\hp\spark-avro_2.11-3.2.0.jar'
x = spark.read.format("avro").load("C:\\Users\\avrofile\\")
x.show(5)

推荐阅读