json - 使用带有 scala 的 Spark 从 Spark 数据框中的 JSON 类型的列中获取所有值,而不考虑键
问题描述
我正在尝试使用 Spark 加载包含电影元数据的 TSV 文件,该 TSV 文件包含 JSON 格式的电影流派信息 [每行的最后一列]
示例文件
975900 /m/03vyhn Ghosts of Mars 2001-08-24 14010832 98.0 {"/m/02h40lc": "English Language"} {"/m/09c7w0": "United States of America"} {"/m/01jfsb": "Thriller", "/m/06n90": "Science Fiction", "/m/03npn": "Horror", "/m/03k9fj": "Adventure", "/m/0fdjb": "Supernatural", "/m/02kdv5l": "Action", "/m/09zvmj": "Space western"}
3196793 /m/08yl5d Getting Away with Murder: The JonBenét Ramsey Mystery 2000-02-16 95.0 {"/m/02h40lc": "English Language"} {"/m/09c7w0": "United States of America"} {"/m/02n4kr": "Mystery", "/m/03bxz7": "Biographical film", "/m/07s9rl0": "Drama", "/m/0hj3n01": "Crime Drama"}
我尝试了下面的代码,它使我能够从 JSON 类型中访问特定值
val ss = SessionCreator.createSession("DataCleaning", "local[*]")//helper function creates a spark session and returns it
val headerInfoRb = ResourceBundle.getBundle("conf.headerInfo")
val movieDF = DataReader.readFromTsv(ss, "D:/Utility/Datasets/MovieSummaries/movie.metadata.tsv")
.toDF(headerInfoRb.getString("metadataReader").split(',').toSeq:_*)//Datareader.readFromTsv is a helper function to read TSV file ,takes sparkSession and file path as input to resurn a dataframe, which uses sparkSession's read function
movieDF.select("wiki_mv_id","mv_nm","mv_genre")
.withColumn("genre_frmttd", get_json_object(col("mv_genre"), "$./m/02kdv5l"))
.show(1,false)
输出
+----------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
|wiki_mv_id|mv_nm |mv_genre |genre_frmttd|
+----------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
|975900 |Ghosts of Mars|{"/m/01jfsb": "Thriller", "/m/06n90": "Science Fiction", "/m/03npn": "Horror", "/m/03k9fj": "Adventure", "/m/0fdjb": "Supernatural", "/m/02kdv5l": "Action", "/m/09zvmj": "Space western"}|Action |
+----------+--------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+------------+
only showing top 1 row
对于数据帧中的每一行,我希望以下面显示的方式显示genre_frmttd列[下面的代码片段是第一个示例行]
[Thriller,Fiction,Horror,Adventure,Supernatural,Action,Space Western]
我是 scala 和 spark 的新手,请建议一些列出值的方法
解决方案
- 使用解析 JSON
from_json
- 将其投射到
MapType(StringType, StringType)
- 仅使用提取值
map_values
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types.{MapType, StringType}
movieDF.select("wiki_mv_id","mv_nm","mv_genre")
.withColumn("genre_frmttd",map_values(from_json(col("mv_genre"),MapType(StringType, StringType))))
.show(1,false)
推荐阅读
- sql - PostgreSQL 11:使用 json_object_agg() 的多个键值对
- javascript - 如何在数组中添加消息?
- html - 单击列后排序图标更改位置
- abap - 如何从 DB 表中选择 LRAW?
- asp.net-mvc - 将 Azure AD 身份验证添加到现有 .Net MVC 应用程序:未验证 id_token
- react-native - 从需要标头的 URL 反应本机图像
- android - 如何在颤振视频播放器中设置边框半径?
- c# - 如何从 JSON 文件反序列化 .NET Core 中封装的对象表?
- typescript - 如何从抽象类的静态方法实例化类
- c# - 如何有效地实现一对多关系?