pyspark - 如何计算pyspark中每行的字数
问题描述
我试过这个:
rdd1= sc.parallelize(["Let's have some fun.",
"To have fun you don't need any plans."])
output = rdd1.map(lambda t: t.split(" ")).map(lambda lists: (lists, len(lists)))
output.foreach(print)
输出:
(["Let's", 'have', 'some', 'fun.'], 4)
(['To', 'have', 'fun', 'you', "don't", 'need', 'any', 'plans.'], 8)
我得到了每行单词的总数。但我想要每行每个单词的计数。
解决方案
你可以试试这个:
from collections import Counter
output = rdd1.map(lambda t: t.split(" ")).map(lambda lists: dict(Counter(lists)))
我将给出一个小python示例:
from collections import Counter
example_1 = "Let's have some fun."
Counter(example_1.split(" "))
# [{"Let's": 1, 'have': 1, 'some': 1, 'fun.': 1}
example_2 = "To have fun you don't need any plans."
Counter(example_2.split(" "))
# {'To': 1, 'have': 1, 'fun': 1, 'you': 1, "don't": 1, 'need': 1, 'any': 1, 'plans.': 1}]