首页 > 解决方案 > 在 zip 文件中使用 pandas 提交作业

问题描述

我有两个库:Pandas 和 utils(我的库),我想在我的代码中导入。由于我在测试 Pandas 并不能很好地工作。

使用boto3and requests(没有预先安装在集群中)它可以创建两个 zip 文件:

因此,我使用需求文件导入 Pandas 并创建一个包含所有 Pandas 依赖项的 zip。我尝试在代码中导入 zip 文件,例如:

sc.addPyFile("libs.zip")

火花提交就像:

spark-submit --deploy-mode client --py-files s3://${BUCKET_NAME}/libs.zip s3://${BUCKET_NAME}/main.py

我尝试了很多在 EMR 集群中提交 Spark 作业,但我对这个问题一无所知:

Traceback (most recent call last):
  File "/mnt/tmp/spark-xxxx/main.py", line 20, in <module>
    import pandas as pd
  File "/mnt/tmp/spark-xxxx/userFiles-xxxx/libs.zip/pandas/__init__.py", line 17, in <module>
ImportError: Unable to import required dependencies:
numpy:

IMPORTANT: PLEASE READ THIS FOR ADVICE ON HOW TO SOLVE THIS ISSUE!

Importing the numpy C-extensions failed. This error can happen for
many reasons, often due to issues with your setup or how NumPy was
installed.

We have compiled some common reasons and troubleshooting tips at:

    https://numpy.org/devdocs/user/troubleshooting-importerror.html

Please note and check the following:

  * The Python version is: Python3.7 from "/usr/bin/python3"
  * The NumPy version is: "1.19.4"

and make sure that they are the versions you expect.
Please carefully study the documentation linked above for further help.

Original error was: No module named 'numpy.core._multiarray_umath'

如何在 spark submit 中导入 Pandas 和另一个库(由我创建)。

标签: pythonpandasapache-sparkpysparkspark-submit

解决方案


推荐阅读