api - 如何读取 RDD 元素(字典格式的列表)并将其转换为 Pyspark API 可读的元组?
问题描述
我有一个作业问题,我必须使用 Pyspark RDD API 来处理它。
我有一个日志文件,其中包含:
{"ID": "John", "Product": "955250", "Review": "I think its good enough, though it could be more END", "Rating": 3.0, "Date": "08 9, 1997"}','{"ID": "Smith", "Product": "B002KXA", "Review": "This is really hard to use and crappy implementation. That's what you get for using crap ingredients!! END", "Rating": 1.0, "Date": "08 10, 1997"}','{"ID": "Wayne", "Product": "630AA9", "Review": "...I go crazy over this movie!!... END", "Rating": 5.0, "Date": "06 21, 1997"}
当我做:
product_review = sc.textFile("product_review.log")
product_review.collect()
我可以看到输出为:*
['{"ID": "John", "Product": "955250", "Review": "I think its good enough, though it could be more END", "Rating": 3.0, "Date": "08 9, 1997"}','{"ID": "Smith", "Product": "B002KXA", "Review": "This is really hard to use and crappy implementation. That's what you get for using crap ingredients!! END", "Rating": 1.0, "Date": "08 10, 1997"}','{"ID": "Wayne", "Product": "630AA9", "Review": "...I go crazy over this movie!!... END", "Rating": 5.0, "Date": "06 21, 1997"}']
但是当我尝试做 groupByKeys 时,RDD 元素不是字典。
如何编码和解析每一行,以便 Pyspark 可以将列表读入键和值的行,以便我可以进行像 distinct 一样的转换来计算唯一 ID 的数量?
我希望输出可以转换为 Pyspark RDD groupByKeys 或 ReduceByKeys 可以使用的东西......例如?
[[("ID", "John"),("Product", "955250"), ("Review", "I think its good enough, though it could be more END"), ("Rating", 3.0), ("Date", "08 9, 1997")], [("ID", "Smith"), ("Product", "B002KXA"), ("Review", "This is really hard to use and crappy implementation. That's what you get for using crap ingredients!! END"), ("Rating", 1.0), ("Date", "08 10, 1997")], [("ID", "Wayne"), ("Product", "630AA9"), ("Review", "...I go crazy over this movie!!... END"), ("Rating", 5.0), ("Date", "06 21, 1997")]]
感谢任何建议如何前进!
解决方案
推荐阅读
- javascript - VSCode:如何禁用有关已弃用的 Node.js 功能的警告
- sql - 每个国家/地区的 SQL 相对百分比
- tsql - 根据来自另一个表的匹配行查找表中的一组行
- javascript - jQuery:如何在 ajax 调用中获取输入值?
- excel - 循环时更新文件引用
- wordpress - 如何在 Woocommerce 的某些页面上隐藏我的帐户菜单?
- google-apps-script - 嵌入为 iframe 的 Community Connector Data Studio 报告在 Google OAuth 后首次不保留 URL 参数
- reactjs - 无法在 Next.js 中运行 eslint
- java - 我在哪里将我的文件位置插入到此代码中?
- django - Django 创建帖子并发送邮件