首页 > 解决方案 > Spark 在到达过滤器功能时杀死作业,错误火花请求执行器仅在粗粒度模式下支持

问题描述

我是 pyspark 的新手,我遇到了一个问题。我正在尝试从配置单元中读取一个表并制作特定列属性report_d的数据框,它是bigint(每一行都是一个bigint数组)

[20171101,20180501,20180501,20180601] [20171001,20140901,20180501,20170901] [20180501,20180501,20180501] [20180601] [20171101,20180501,20180501,20180601] [20171001,20140901,20180501,20180501,20180501,20170901] [20171101,20180501,20180501,20180501, 20180501, 20180501,20180601] [20171001,20140901,20180501,20170901]

我想找到这个列和日期之间的月数,它是一个整数。当我尝试使用一些 python 函数时,任何人都可以提供帮助吗?

sc = SparkContext("local", "monthDif") sqlContext= HiveContext(sc) df=sqlContext.sql("select report_d from hive_table where date = 20180930")

def monthDiff(df):
year = [int(val)/10000 for val in df]
month = [int(val % 10000)/100 for val in df]
day = [int(val)%100 for val in df]
yy = 2018
mm = 06
yyDif = [int(abs(val - yy)) for val in year]
mmDif = [int(val -mm) for val in month ]
countyytomm = [int(val) * 12 for val in yyDif]
return list(map(add,countyytomm, mmDif))
myudf = udf(monthDiff)
newdf = df.withColumn("monthDiff", myudf(df['report_d']))
filteredDf = newdf.filter(newdf['monthDiff']>24)`

我正在使用上面的 udf,但是当它到达过滤器时出现错误。任何人都可以帮忙吗?我正在使用 Spark 1.6.0

标签: python-3.xapache-sparkpyspark

解决方案


推荐阅读