apache-spark - PYCHARM 中的 Spark-Avro 错误 [TypeError: 'RecordSchema' 对象不可迭代]
问题描述
我正在尝试运行一个简单的 spark 程序来读取PYCHARM环境中的 avro 文件。我不断收到这个我无法解决的错误。我感谢您的帮助。
from environment_variables import *
import avro.schema
from pyspark.sql import SparkSession
Schema = avro.schema.parse(open(SCHEMA_PATH, "rb").read())
print(Schema)
spark = SparkSession.builder.appName("indu").getOrCreate()
df = spark.read.format("avro").load(list(Schema))
print(df)
打印的架构如下所示
{"type": "record", "name": "DefaultEventRecord", "namespace": "io.divolte.record", "fields": [{"type": "boolean", "name": "detectedDuplicate"}, {"type": "boolean", "name": "detectedCorruption"}, {"type": "boolean", "name": "firstInSession"}, {"type": "long", "name": "clientTimestamp"}, {"type": "long", "name": "timestamp"}, {"type": "string", "name": "remoteHost"}, {"type": ["null", "string"], "name": "referer", "default": null}, {"type": ["null", "string"], "name": "location", "default": null}, {"type": ["null", "int"], "name": "devicePixelRatio", "default": null}, {"type": ["null", "int"], "name": "viewportPixelWidth", "default": null}, {"type": ["null", "int"], "name": "viewportPixelHeight", "default": null}, {"type": ["null", "int"], "name": "screenPixelWidth", "default": null}, {"type": ["null", "int"], "name": "screenPixelHeight", "default": null}, {"type": ["null", "string"], "name": "partyId", "default": null}, {"type": ["null", "string"], "name": "sessionId", "default": null}, {"type": ["null", "string"], "name": "pageViewId", "default": null}, {"type": ["null", "string"], "name": "eventId", "default": null}, {"type": "string", "name": "eventType", "default": "unknown"}, {"type": ["null", "string"], "name": "userAgentString", "default": null}, {"type": ["null", "string"], "name": "userAgentName", "default": null}, {"type": ["null", "string"], "name": "userAgentFamily", "default": null}, {"type": ["null", "string"], "name": "userAgentVendor", "default": null}, {"type": ["null", "string"], "name": "userAgentType", "default": null}, {"type": ["null", "string"], "name": "userAgentVersion", "default": null}, {"type": ["null", "string"], "name": "userAgentDeviceCategory", "default": null}, {"type": ["null", "string"], "name": "userAgentOsFamily", "default": null}, {"type": ["null", "string"], "name": "userAgentOsVersion", "default": null}, {"type": ["null", "string"], "name": "userAgentOsVendor", "default": null}, {"type": ["null", "int"], "name": "cityIdField", "default": null}, {"type": ["null", "string"], "name": "cityNameField", "default": null}, {"type": ["null", "string"], "name": "countryCodeField", "default": null}, {"type": ["null", "int"], "name": "countryIdField", "default": null}, {"type": ["null", "string"], "name": "countryNameField", "default": null}]}
得到的错误是,
21/03/02 16:06:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
File "X:\Git_repo\Project_Red\spark_streaming\spark_scripting.py", line 15, in <module>
df = spark.read.format("avro").load(list(jsonFormatSchema))
TypeError: 'RecordSchema' object is not iterable
我感谢您的帮助。
解决方案
您的代码中必须有 3 处更正:
- 您不必单独加载架构文件,因为任何 Avro 数据文件都已在其标题中包含它。
- 您的
load()
方法spark.read.format("avro").load(list(Schema))
需要 Avro 文件的路径,而不是模式。 print(df)
不会给出任何有意义的输出。df.show()
如果您想查看 Avro 文件中的数据,只需使用它。
话虽如此,您可能已经对代码中必须更改的内容有所了解:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("indu").getOrCreate()
df = spark.read.format("avro").load(DATA_PATH)
df.printSchema()
df.show(truncate=False)
推荐阅读
- ios - iOS Localization of bundle display name for en-GB and en-US
- git - 如果有新信息,则修改行,否则没有
- react-native - React-Native 0.60+ 从自动链接中排除库
- php - 如何重定向到公用文件夹
- video - 如何使用 ffmpeg 从视频中去除隔行扫描效果
- react-native - 如何在 react-native-app-intro-slider 中使用 goToSlide 方法?
- vue.js - 每隔几秒设置一次 setInterval 以重新更新 axios 请求以更新 vue 项目中显示的数据是最佳做法吗?
- powershell - 将数据从 csv 文件传递到 Powershell 中的现有 foreach 循环
- javascript - 如何将当前页面的标题拉入 URL/src?
- python - 来自提供的 x 的随机值 +/- 10,不会超过 0-255 的全局范围