首页 > 解决方案 > ModuleNotFoundError:没有名为“pyarrow._dataset”的模块

问题描述

我决定熟悉箭头包。我认为运行一些其用法示例(https://github.com/apache/arrow/tree/master/python/examples/minimal_build)是个好主意。 在此处输入图像描述

docker build -t arrow_ubuntu_minimal -f Dockerfile.ubuntu .
docker run --rm -t -i -v $PWD:/io arrow_ubuntu_minimal /io/build_venv.sh

不幸的是,在运行后一个命令控制台后会产生:

在此处输入图像描述

E   ModuleNotFoundError: No module named 'pyarrow._dataset'

pyarrow/dataset.py:23: ModuleNotFoundError
====================================================================================== warnings summary ======================================================================================
pyarrow/tests/test_serialization.py:283
  /root/arrow/python/pyarrow/tests/test_serialization.py:283: PytestDeprecationWarning: @pytest.yield_fixture is deprecated.
  Use @pytest.fixture instead; they are the same.
    @pytest.yield_fixture(scope='session')

pyarrow/tests/test_pandas.py::TestConvertListTypes::test_infer_lists
pyarrow/tests/test_pandas.py::TestConvertListTypes::test_to_list_of_structs_pandas
pyarrow/tests/test_pandas.py::TestConvertListTypes::test_nested_large_list
  /root/venv/lib/python3.6/site-packages/pandas/core/dtypes/missing.py:475: DeprecationWarning: elementwise comparison failed; this will raise an error in the future.
    if np.any(np.asarray(left_value != right_value)):

pyarrow/tests/test_pandas.py::TestConvertListTypes::test_nested_large_list
  /root/venv/lib/python3.6/site-packages/pandas/core/dtypes/missing.py:475: VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must specify 'dtype=object' when creating the ndarray
    if np.any(np.asarray(left_value != right_value)):

-- Docs: https://docs.pytest.org/en/stable/warnings.html
================================================================================== short test summary info ===================================================================================
FAILED pyarrow/tests/parquet/test_dataset.py::test_write_to_dataset_filesystem - ModuleNotFoundError: No module named 'pyarrow._dataset'
============================================================ 1 failed, 3168 passed, 689 skipped, 16 xfailed, 5 warnings in 48.01s ============================================================
marcin@marcin-G3-3579: 

有没有人遇到过类似的问题或知道如何解决?

我目前正在使用 ubuntu 20.04。也许这可能会导致问题,因为示例是在 ubuntu 18.04 上设置的,但我看不到检查它的方法。

标签: pythonpyarrow

解决方案


这看起来像是最小示例中的错误。随意提交 JIRA

Arrow C++ 包有许多功能标志,可以打开(以启用功能)或关闭(以加快构建时间并减少依赖关系)。依赖于某些特性的 python 测试应该检查该标志是否存在,如果不存在则跳过。这个测试没有这样做。

与此同时,您可以忽略测试失败,将测试更改为跳过(我认为这是@pytest.mark.dataset在测试名称上方添加),或者将数据集添加到您的 C++ 构建中(可能是我的首选选项)。

要将数据集添加到您的 C++ 构建中,您-DARROW_DATASET=ON可以-DARROW_PARQUET=ONbuild_venv.sh.


推荐阅读