apache-spark - Spark RDD find ratio of for key-value pairs
问题描述
My rdd contains key-value pairs such as this:
(key1, 5),
(key2, 10),
(key3, 20),
I want to perform a map operation that associates each key with its respect ratio in the entire rdd, such as this:
(key1, 5/35),
(key2, 10/35),
(key3, 20/35),
I am struggling to find a method to do this using standard functions, any help will be appreciated.
解决方案
You can calculate the sum and divide each value by the sum:
from operator import add
rdd = sc.parallelize([('key1', 5), ('key2', 10), ('key3', 20)])
total = rdd.values().reduce(add)
rdd2 = rdd.mapValues(lambda x: x/total)
rdd2.collect()
# [('key1', 0.14285714285714285), ('key2', 0.2857142857142857), ('key3', 0.5714285714285714)]
In Scala it would be
val rdd = sc.parallelize(List(("key1", 5), ("key2", 10), ("key3", 20)))
val total = rdd.values.reduce(_+_)
val rdd2 = rdd.mapValues(1.0*_/total)
rdd2.collect
// Array[(String, Double)] = Array((key1,0.14285714285714285), (key2,0.2857142857142857), (key3,0.5714285714285714))
推荐阅读
- c# - 拒绝连接 - Redis 用于 ASP.Net Core 2.1 中的数据保护密钥
- ios - dyld:找不到符号:_OBJC_CLASS_$_GULObjectSwizzler
- python - 请求 Bing Text to Speech API 时出现 401 Unauthorized
- solr - Solr:当我们在 SolrConfig 中有 autoCommit 时,我们是否需要 commit=true 查询?
- c# - C# 中的 DAO 连接到数据库
- c# - 试图从 C# 中的结构数组中删除一个条目。获取相互矛盾的信息?
- python - 集合是确定性的吗?
- java - 如何在行之间创建具有不同背景的警报对话框列表
- api - 根据前一帧的输出计算 Maya 输出属性
- java - FirebaseRecyclerAdapter(将列表发送到 UserAdapter 类)