python - Rdd lambda function confusion around rows vs columns
问题描述
I have a spark RDD (full code below) and I am a bit confused.
Given the input data:
385 | 1
291 | 2
If I have the below lambda function why in the reduceByKey do we have x[0]+y[0] = 385+291? Surely X and Y relate to the different columns of the RDD? Or do I take this to mean that they refer to the
totalsByAge = rdd2.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y:(x[0] + y[0], x[1] + y[1]))
Full code:
import findspark
findspark.init()
import pyspark
#UserID | Name | Age | Num_Friends
#r before the filepath converts it to a raw string
lines = sc.textFile(r"c:\Users\kiera\Downloads\fakefriends.csv")
#For each line in the file, split it at the comma
#split 2 is the age
#Split 3 is the number of friends
def splitlines(line):
fields = line.split(',')
age = int(fields[2])
numFriends = int(fields[3])
return (age, numFriends)
rdd2 = lines.map(splitlines)
totalsByAge = rdd2.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y:(x[0] + y[0], x[1] + y[1]))
rdd2 looks something like this
[(33, 385),
(26, 2),
(55, 221),
(40, 465),
(68, 21),
(59, 318),
(37, 220),
(54, 307)....
解决方案
好的,当您执行第一步时:
rdd2 = spark.sparkContext.parallelize([
(33, 385), (26, 2), (55, 221), (40, 465), (68, 21), (59, 318), (37, 220), (54, 307)
])
# Simple count example
# Make a key value pair like ((age, numFriends), 1)
# Now your key is going to be (age, numFriends) and value is going to be 1
# When you say reduceByKey, it will add up all values for the same key
rdd3 = rdd2.map(lambda x: (x, 1)).reduceByKey(lambda x, y: x+y)
totalsByAge = rdd2.mapValues(lambda x: (x, 1)).reduceByKey(lambda x, y:(x[0] + y[0], x[1] + y[1]))
在上述情况下,您正在做的是:
- 创建成对的RDD
(age, (numFriends, 1))
reduceByKey
在哪里,你采取x
并y
执行(x[0] + y[0], x[1] + y[1])
它。在这里,你x
是 RDD 的一个元素,y
是另一个(但按年龄分组)- 您制作年龄组(因为第一个元素是您的键,即
age
),添加x[0]
withy[0]
,它会numFriends
按年龄组相加,并且 addx[1]
withy[1]
会添加我们在每个年龄组的第一步中添加的计数器mapValues
。
推荐阅读
- php - 将返回数组存储到新变量中
- fortran - 是否有与在 Python 中解压缩参数列表的 Fortran 等价物?
- java - 如何将数组中的字符串更改为整数
- r - 如何在R中从一个具有多个条件的数据帧创建多个数据帧
- javascript - 在 waituntil 方法 webdriver.io 上返回 true 或 false
- webpack - 无法使用电子生成器找到依赖关系(node-pre-gyp)
- python - 关于 Python 中决策树代码中的索引的问题
- unit-testing - SilverStripe Sapphire 对 POST 请求的功能测试始终返回 404 状态
- ios - 我收到错误“线程 1:EXC_BREAKPOINT (code=1, subcode=0x102d3c320)”我该如何解决?
- excel - Excel VBA 宏 - Pt2