首页 > 解决方案 > PySpark - 使用 read.format('json') 读取 JSON 文件时,DataFrame 只包含第一行。为什么会这样?

问题描述

我正在读取具有这种格式的 JSON 文件:

{"username": "robert87", "currency": "BZD", "amount": 143472}
{"username": "taylorrobert", "currency": "TZS", "amount": 183074}
{"username": "ascott", "currency": "LRD", "amount": 154351}
{"username": "julie29", "currency": "JPY", "amount": 128404}
{"username": "rachelrogers", "currency": "CUP", "amount": 46338}
{"username": "tiffanyschmidt", "currency": "GBP", "amount": 88392}

尽管 JSON 文件包含 6 行,但当我运行以下命令时:

df = spark.read.format('json').load('file.json')

df.printSchema()
df.show()

我只取回第一行:

df:pyspark.sql.dataframe.DataFrame
amount:long
currency:string
username:string
Session created
root
 |-- amount: long (nullable = true)
 |-- currency: string (nullable = true)
 |-- username: string (nullable = true)

+------+--------+--------+
|amount|currency|username|
+------+--------+--------+
|143472|     BZD|robert87|
+------+--------+--------+

为什么我看不到相应列中的所有其他记录?是否与我使用的格式有关?

标签: pythonjsonpyspark

解决方案


我可以看到所有的行,试试下面

df = spark.read.json("/FileStore/tables/a_json.json")
df.show()
+------+--------+--------------+
|amount|currency|      username|
+------+--------+--------------+
|143472|     BZD|      robert87|
|183074|     TZS|  taylorrobert|
|154351|     LRD|        ascott|
|128404|     JPY|       julie29|
| 46338|     CUP|  rachelrogers|
| 88392|     GBP|tiffanyschmidt|
+------+--------+--------------+

推荐阅读