首页 > 解决方案 > PYCHARM 中的 Spark-Avro 错误 [TypeError: 'RecordSchema' 对象不可迭代]

问题描述

我正在尝试运行一个简单的 spark 程序来读取PYCHARM环境中的 avro 文件。我不断收到这个我无法解决的错误。我感谢您的帮助。

from environment_variables import *
import avro.schema
from pyspark.sql import SparkSession

Schema = avro.schema.parse(open(SCHEMA_PATH, "rb").read())
print(Schema)
spark = SparkSession.builder.appName("indu").getOrCreate()
df = spark.read.format("avro").load(list(Schema))
print(df)

打印的架构如下所示

{"type": "record", "name": "DefaultEventRecord", "namespace": "io.divolte.record", "fields": [{"type": "boolean", "name": "detectedDuplicate"}, {"type": "boolean", "name": "detectedCorruption"}, {"type": "boolean", "name": "firstInSession"}, {"type": "long", "name": "clientTimestamp"}, {"type": "long", "name": "timestamp"}, {"type": "string", "name": "remoteHost"}, {"type": ["null", "string"], "name": "referer", "default": null}, {"type": ["null", "string"], "name": "location", "default": null}, {"type": ["null", "int"], "name": "devicePixelRatio", "default": null}, {"type": ["null", "int"], "name": "viewportPixelWidth", "default": null}, {"type": ["null", "int"], "name": "viewportPixelHeight", "default": null}, {"type": ["null", "int"], "name": "screenPixelWidth", "default": null}, {"type": ["null", "int"], "name": "screenPixelHeight", "default": null}, {"type": ["null", "string"], "name": "partyId", "default": null}, {"type": ["null", "string"], "name": "sessionId", "default": null}, {"type": ["null", "string"], "name": "pageViewId", "default": null}, {"type": ["null", "string"], "name": "eventId", "default": null}, {"type": "string", "name": "eventType", "default": "unknown"}, {"type": ["null", "string"], "name": "userAgentString", "default": null}, {"type": ["null", "string"], "name": "userAgentName", "default": null}, {"type": ["null", "string"], "name": "userAgentFamily", "default": null}, {"type": ["null", "string"], "name": "userAgentVendor", "default": null}, {"type": ["null", "string"], "name": "userAgentType", "default": null}, {"type": ["null", "string"], "name": "userAgentVersion", "default": null}, {"type": ["null", "string"], "name": "userAgentDeviceCategory", "default": null}, {"type": ["null", "string"], "name": "userAgentOsFamily", "default": null}, {"type": ["null", "string"], "name": "userAgentOsVersion", "default": null}, {"type": ["null", "string"], "name": "userAgentOsVendor", "default": null}, {"type": ["null", "int"], "name": "cityIdField", "default": null}, {"type": ["null", "string"], "name": "cityNameField", "default": null}, {"type": ["null", "string"], "name": "countryCodeField", "default": null}, {"type": ["null", "int"], "name": "countryIdField", "default": null}, {"type": ["null", "string"], "name": "countryNameField", "default": null}]}

得到的错误是,

21/03/02 16:06:50 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
Traceback (most recent call last):
  File "X:\Git_repo\Project_Red\spark_streaming\spark_scripting.py", line 15, in <module>
    df = spark.read.format("avro").load(list(jsonFormatSchema))
TypeError: 'RecordSchema' object is not iterable

我感谢您的帮助。

标签: apache-sparkpysparkapache-spark-sqlspark-streamingavro

解决方案


您的代码中必须有 3 处更正:

  1. 您不必单独加载架构文件,因为任何 Avro 数据文件都已在其标题中包含它。
  2. 您的load()方法spark.read.format("avro").load(list(Schema))需要 Avro 文件的路径,而不是模式。
  3. print(df)不会给出任何有意义的输出。df.show()如果您想查看 Avro 文件中的数据,只需使用它。

话虽如此,您可能已经对代码中必须更改的内容有所了解:

from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("indu").getOrCreate()
df = spark.read.format("avro").load(DATA_PATH)
df.printSchema()
df.show(truncate=False)

推荐阅读