首页 > 解决方案 > 如何读取 RDD 元素(字典格式的列表)并将其转换为 Pyspark API 可读的元组?

问题描述

我有一个作业问题,我必须使用 Pyspark RDD API 来处理它。

我有一个日志文件,其中包含:

{"ID": "John", "Product": "955250", "Review": "I think its good enough, though it could be more END", "Rating": 3.0, "Date": "08 9, 1997"}','{"ID": "Smith", "Product": "B002KXA", "Review": "This is really hard to use and crappy implementation. That's what you get for using crap ingredients!! END", "Rating": 1.0, "Date": "08 10, 1997"}','{"ID": "Wayne", "Product": "630AA9", "Review": "...I go crazy over this movie!!... END", "Rating": 5.0, "Date": "06 21, 1997"}

当我做:

product_review = sc.textFile("product_review.log")
product_review.collect()

我可以看到输出为:*

['{"ID": "John", "Product": "955250", "Review": "I think its good enough, though it could be more END", "Rating": 3.0, "Date": "08 9, 1997"}','{"ID": "Smith", "Product": "B002KXA", "Review": "This is really hard to use and crappy implementation. That's what you get for using crap ingredients!! END", "Rating": 1.0, "Date": "08 10, 1997"}','{"ID": "Wayne", "Product": "630AA9", "Review": "...I go crazy over this movie!!... END", "Rating": 5.0, "Date": "06 21, 1997"}']

但是当我尝试做 groupByKeys 时,RDD 元素不是字典。

如何编码和解析每一行,以便 Pyspark 可以将列表读入键和值的行,以便我可以进行像 distinct 一样的转换来计算唯一 ID 的数量?

我希望输出可以转换为 Pyspark RDD groupByKeys 或 ReduceByKeys 可以使用的东西......例如?

[[("ID", "John"),("Product", "955250"), ("Review", "I think its good enough, though it could be more END"), ("Rating", 3.0), ("Date", "08 9, 1997")], [("ID", "Smith"), ("Product", "B002KXA"), ("Review", "This is really hard to use and crappy implementation. That's what you get for using crap ingredients!! END"), ("Rating", 1.0), ("Date", "08 10, 1997")], [("ID", "Wayne"), ("Product", "630AA9"), ("Review", "...I go crazy over this movie!!... END"), ("Rating", 5.0), ("Date", "06 21, 1997")]]

感谢任何建议如何前进!

标签: apiapache-sparkpysparkrdd

解决方案


推荐阅读