python - pyspark - Join two RDDs - Missing third column
问题描述
I'm very new at Pyspark please take in consideration :)
Basically I've this two textfiles:
file1:
1,9,5
2,7,4
3,8,3
file2:
1,g,h
2,1,j
3,k,i
And the Python code:
file1 = sc.textFile("/user/cloudera/training/file1.txt").map(lambda line: line.split(","))
file2 = sc.textFile("/user/cloudera/training/file2.txt").map(lambda line: line.split(","))
Now doing this join:
join_file = file1.join(file2)
I was hoping to get this:
(1,(9,5),(g,h))
(2,(7,4),(i,j))
(3,(8,3),(k,1))
However, I am getting a different result:
(1, (9,g))
(3, (8,k))
(2, (7,1))
Am I missing any parameter on Join?
Thanks!
解决方案
这应该可以解决问题:
file1 = sc.textFile("/FileStore/tables/f1.txt").map(lambda line: line.split(",")).map(lambda x: (x[0], list(x[1:])))
file2 = sc.textFile("/FileStore/tables/f2.txt").map(lambda line: line.split(",")).map(lambda x: (x[0], list(x[1:])))
join_file = file1.join(file2)
join_file.collect()
返回 Unicode u':
Out[3]:
[(u'2', ([u'7', u'4'], [u'1', u'j'])),
(u'1', ([u'9', u'5'], [u'g', u'h'])),
(u'3', ([u'8', u'3'], [u'k', u'i']))]
推荐阅读
- ms-access - 带有 7 个集合或 1 个数组的 MS Access 更新表?
- python - 将列表解析为numpy数组时如何去掉括号
- javascript - 在 javascript 或 jQuery 的选择框数组中设置 (Nth) 元素的选定值
- core-audio - 如何混合音频文件并上传到服务器
- css - 如何剪辑图像并使其响应?
- android - react-native错误找不到变量:样式
- java - 构建多图的正确方法
- kendo-grid - 如何在 Angular Kendo Grid 中格式化日期
- php - OUTPUT 数组中的单独 1 个键
- google-chrome-devtools - 如何在 Chrome 开发工具中使用溢出:自动截取 DOM 节点的屏幕截图?