首页 > 解决方案 > 在 PySpark/Python RDD 中过滤

问题描述

我有一个list这样的:

["Dhoni 35 WC 785623", "Sachin 40 Batsman 4500", "Dravid 45 Batsman 50000", "Kumble 41 Bowler 456431", "Srinath 41 Bowler 65465"]

应用过滤器后,我想要这样:

["Dhoni WC", "Sachin Batsman", "Dravid Batsman", "Kumble Bowler", "Srinath Bowler"]

我试过这种方式

m = sc.parallelize(["Dhoni 35 WC 785623","Sachin 40 击球手 4500","Dravid 45 击球手 50000","Kumble 41 Bowler 456431","Srinath 41 Bowler 65465"])

n = m.map(lambda k:k.split(' '))

o = n.map(lambda s:(s[0])) o.collect()

['Dhoni'、'Sachin'、'Dravid'、'Kumble'、'Srinath']

q = n.map(lambda s:s[2])

q.collect()

['WC'、'击球手'、'击球手'、'投球手'、'投球手']

标签: python-3.xpysparkrdd

解决方案


前提是,您的所有列表项都具有相同的格式,实现此目的的一种方法是使用map.

rdd = sc.parallelize(["Dhoni 35 WC 785623","Sachin 40 Batsman 4500","Dravid 45 Batsman 50000","Kumble 41 Bowler 456431","Srinath 41 Bowler 65465"])

rdd.map(lambda x:(x.split(' ')[0]+' '+x.split(' ')[2])).collect()

输出:

['Dhoni WC', 'Sachin Batsman', 'Dravid Batsman', 'Kumble Bowler', 'Srinath Bowler']

推荐阅读