首页 > 解决方案 > How to correctly use modules in rdd.map with pyspark

问题描述

As the title suggests, I'm trying to create external modules which are afterwards imported and invoked in a simple rdd.map function. An example below:

## main.py ##
myrdd = spark.sparkContext.parallelize([1,2,3,4,5])
spark.sparkContext.addPyFile("myModule.py")

import myModule as mm

myrdd.map(lambda x: mm.Module.test(x)).collect()

## myModule.py ##
class Module():
    def test(self,x):
       return x * 2

When trying to run this with spark submit I get the following error:

test() missing 1 required positional argument: 'x'

Can someone point the error out?

Thank you very much

标签: pythonapache-sparkpysparkapache-spark-2.0

解决方案


test() 不是类方法,所以不能直接调用 Module.test(x) 。

而是创建一个 Module 对象并在其上调用 test() 方法,如下所示:

myrdd = spark.sparkContext.parallelize([1,2,3,4,5])
spark.sparkContext.addPyFile("myModule.py")
import myModule as mm

myrdd.map(lambda x: mm.Module().test(x)).collect()
[2, 4, 6, 8, 10]

推荐阅读