python - 使用 spark submit 提交自定义 udf 时输入错误
问题描述
TypeError
使用spark-submit --py-files udf 提交时获取
TypeError: 'in <string>' requires string as left operand, not NoneType
我已经在 proj_udf.py 中编写了所有 UDF
group_1 =['EAST','NORTH','SOUTH','SOUTHEAST','SOUTHWEST']
group_2 =['AUTORX','CAREWORKS','CHIROSPORT']
mearged_list = group_1 + group_2
str1 = ''.join(mearged_list)
def search_list(column):
return any(column in item for item in str1)
sqlContext.udf.register("search_list_udf", search_list, BooleanType())
从 pyspark-shell 调用此函数时,它不会引发任何错误。当我使用 spark-submit 运行此程序时,出现以下错误。
错误:
File "/hd_data/disk23/hadoop/yarn/local/usercache/hscrsawd/appcache/application_1530205632093_12027/container_1530205632093_12027_01_000007/pyspark.zip/pyspark/worker.py", line 177, in main
process()
File "/hd_data/disk23/hadoop/yarn/local/usercache/hscrsawd/appcache/application_1530205632093_12027/container_1530205632093_12027_01_000007/pyspark.zip/pyspark/worker.py", line 172, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/hd_data/disk23/hadoop/yarn/local/usercache/hscrsawd/appcache/application_1530205632093_12027/container_1530205632093_12027_01_000007/pyspark.zip/pyspark/worker.py", line 104, in <lambda>
func = lambda _, it: map(mapper, it)
File "<string>", line 1, in <lambda>
File "/hd_data/disk23/hadoop/yarn/local/usercache/hscrsawd/appcache/application_1530205632093_12027/container_1530205632093_12027_01_000007/pyspark.zip/pyspark/worker.py", line 71, in <lambda>
return lambda *a: f(*a)
File "NAM_Udfs.py", line 115, in search_list
return any(column in item for item in str1)
File "NAM_Udfs.py", line 115, in <genexpr>
return any(column in item for item in str1)
TypeError: 'in <string>' requires string as left operand, not NoneType
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
解决方案
您只需要更改您的 UDF 以考虑 NULL,如下所示。您可能还想考虑列值中的空字符串。
def search_list(column):
if column is None:
return False
return any(column in item for item in str1)
推荐阅读
- reactjs - 在 react-native 动画中使用计算变量
- python - AttributeError:“function”对象没有属性“tableId”。Apache Beam 数据流运行器
- node.js - 节点 TypeScript 导入/导出机制
- excel - 如何使用 vba 在 excel 中仅将一个工作表保存为不同的工作簿?
- azure - 错误“指定的应用程序包不存在。” 在从 azure cli 检查是否存在 azure 批处理应用程序包时?
- r - 手动制作随机森林模型不会给出相同的结果
- mongodb - MongoDB 中的并行更新问题
- java - 如何正确构建“PUT 方法”并执行“实习请求”?
- performance - 为什么最大的内容绘画几乎是 4 秒?
- git - Visual Studio Code 中文件名旁边的这个标记数字是什么