首页 > 解决方案 > How to import a python module that I added to a cluster via --py-files?

问题描述

I have some custom jdbc drivers that I want to use in an application. I include these as --py-files when I spark submit to a Kubernetes spark cluster:

spark-submit --py-files s3a://bucket/pyfiles/pyspark_jdbc.zip my_application.py

This gives me:

java.io.FileNotFoundException: File file:/opt/spark/work-dir/pyspark_jdbc.zip does not exist

As other answers have told me, I need to actually add that zip file to the PYTHONPATH. Now, I find that to be no longer true with at least Spark 2.3+, but lets do it with:

spark.sparkContext.addPyFile("pyspark_jdbc.zip")

Looking into the cluster logs, I see:

19/10/21 22:40:56 INFO Utils: Fetching s3a://bucket/pyfiles/pyspark_jdbc.zip to 
/var/data/spark-52e390f5-85f4-41c4-9957-ff79f1433f64/spark-402e0a00-6806-40a7-a17d-5adf39a5c2d4/userFiles-680c1bce-ad5f-4a0b-9160-2c3037eefc29/fetchFileTemp5609787392859819321.tmp

So, the pyfiles got imported for sure, but into /var/data/... and not into my working directory. Therefore, when I go to add the location of my .zip file to my python path, I don't know where it is. Some diagnostics on the cluster right before attempting to add the python files:

> print(sys.path)
[..., 
 '/var/data/spark-52e390f5-85f4-41c4-9957-ff79f1433f64/spark-402e0a00-6806-40a7-a17d-5adf39a5c2d4/userFiles-680c1bce-ad5f-4a0b-9160-2c3037eefc29', 
 '/opt/spark/work-dir/s3a', 
 '//bucket/pyfiles/pyspark_jdbc.zip'
...]
> print(os.getcwd())
/opt/spark/work-dir
> subprocess.run(["ls", "-l"])
total 0

So we see that pyspark did attempt to add the s3a:// file I added via --py-files to PYTHONPATH, except that it mis-interpreted the : and did not add the path correctly. The /var/data/... directory is in the PYTHONPATH, but the specific .zip file is not so I cannot import from it.

How can I solve this problem going forward? The .zip file has not been correctly added to the path, but within my program, I do not know either

a. the path to the s3a:// that pyspark attempted to add to the PYTHONPATH

b. the path to the `var/data/.../ local location of the .zip file. I know it is in the path somewhere, and I suppose I could parse it out, but that would be messy.

What is an elegant solution to this?

标签: apache-sparkimportpysparkpython-import

解决方案


一个(更好的)解决方案是使用 pyspark 中的 SparkFiles 对象来定位您的导入。

from pyspark import SparkFiles

spark.sparkContext.addPyFile(SparkFiles.get("pyspark_jdbc.zp"))

推荐阅读