首页 > 解决方案 > 如何计算pyspark中每行的字数

问题描述

我试过这个:

rdd1= sc.parallelize(["Let's have some fun.",
  "To have fun you don't need any plans."])
output = rdd1.map(lambda t: t.split(" ")).map(lambda lists: (lists, len(lists)))
output.foreach(print)

输出:

(["Let's", 'have', 'some', 'fun.'], 4)
(['To', 'have', 'fun', 'you', "don't", 'need', 'any', 'plans.'], 8)

我得到了每行单词的总数。但我想要每行每个单词的计数。

标签: pysparkrdd

解决方案


你可以试试这个:

from collections import Counter 

output = rdd1.map(lambda t: t.split(" ")).map(lambda lists: dict(Counter(lists)))

我将给出一个小python示例:

from collections import Counter

example_1 = "Let's have some fun."
Counter(example_1.split(" "))
# [{"Let's": 1, 'have': 1, 'some': 1, 'fun.': 1}

example_2 = "To have fun you don't need any plans."
Counter(example_2.split(" "))
# {'To': 1, 'have': 1, 'fun': 1, 'you': 1, "don't": 1, 'need': 1, 'any': 1, 'plans.': 1}]

推荐阅读