python - 使用 Pyspark 将每个 json 对象读取为 Dataframe 中的单行?
问题描述
我有以下 JSON 文件
{"name":"John", "age":31, "city":"New York"}
{"name":"Henry", "age":41, "city":"Boston"}
{"name":"Dave", "age":26, "city":"New York"}
因此,我需要将每个 json 行与 Dataframe 一起作为单行读取。
以下是预期的输出:
我试过下面的代码:
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName('Read Json') \
.getOrCreate()
df = spark.read.format('json').load('sample_json')
df.show()
但我只能得到以下输出:
请帮助我。提前致谢。
解决方案
读取文件,json
然后使用to_json
函数创建json_column
.
1.Using to_json function:
from pyspark.sql.functions import *
spark.read.json("sample.json").\
withColumn("Json_column",to_json(struct(col("age"),col('city'),col('name')))).\
show(10,False)
#+---+--------+-----+------------------------------------------+
#|age|city |name |Json_column |
#+---+--------+-----+------------------------------------------+
#|31 |New York|John |{"age":31,"city":"New York","name":"John"}|
#|41 |Boston |Henry|{"age":41,"city":"Boston","name":"Henry"} |
#|26 |New York|Dave |{"age":26,"city":"New York","name":"Dave"}|
#+---+--------+-----+------------------------------------------+
#or more dynamic way
df=spark.read.json("sample.json")
df.withColumn("Json_column",to_json(struct([col(c) for c in df.columns]))).show(10,False)
#+---+--------+-----+------------------------------------------+
#|age|city |name |Json_column |
#+---+--------+-----+------------------------------------------+
#|31 |New York|John |{"age":31,"city":"New York","name":"John"}|
#|41 |Boston |Henry|{"age":41,"city":"Boston","name":"Henry"} |
#|26 |New York|Dave |{"age":26,"city":"New York","name":"Dave"}|
#+---+--------+-----+------------------------------------------+
2.Other approach using get_json_object function:
将json文件作为文本读取,然后name,age,city
通过从json object
.
from pyspark.sql.functions import *
spark.read.text("sample.json").\
withColumn("name",get_json_object(col("value"),"$.name")).\
withColumn("city",get_json_object(col("value"),"$.city")).\
withColumn("age",get_json_object(col("value"),"$.age")).\
withColumnRenamed("value","Json_column").\
select("age","city","name","Json_column").\
show(10,False)
#+---+--------+-----+--------------------------------------------+
#|age|city |name |Json_column |
#+---+--------+-----+--------------------------------------------+
#|31 |New York|John |{"name":"John", "age":31, "city":"New York"}|
#|41 |Boston |Henry|{"name":"Henry", "age":41, "city":"Boston"} |
#|26 |New York|Dave |{"name":"Dave", "age":26, "city":"New York"}|
#+---+--------+-----+--------------------------------------------+
推荐阅读
- sql-server - Select node/descendant pairs for every permutation in tree hierarchy
- laravel - php artisan make:auth not defined in laravel 8.49
- ios - How do I bind a Boolean property to BehaviorRelay
(value: false) - ios - Set dynamic height for TabView in SwiftUI for Custom Views
- database - Some HTML Tags not been store in WordPress Database
- http-headers - Header Image size
- javascript - What is the difference between row.append and $$.filter
- java - 带有 FXML JavaFX 的 HashMap 找不到我的密钥
- java - Why my Java code is working in windows and not on Mac?
- kotlin - Is it safe to call a (kotlin coroutine) withContext within another withContext?