首页 > 解决方案 > 使用 NLTK 或 PATTERN 重复标签磁盘“C://C://..spark-core_2.11-2.3.2.jar”之类的导入惰性时,PYTHONPATH 错误 [Windows 123] 与 Pyspark

问题描述

问题是Windows路径和库是惰性导入的,比如nltk,意味着nltk和pattern在使用时会导入它们的库,此时模块importlib_metada.py和pathlib.py尝试读取PYTHONPATH的值不正确在路径(D:/D:/)中,然后代码爆炸。

首先,我们有一个像这样的简单函数

import nltk
def print_stopwords():
  print(nltk.corpus.stopwords)

在本地模式下,您可以运行它并获得所有停用词,OK

如果你想在 Spark 的地图中使用这个函数来实现 Pyspark 工作流,上面的代码是行不通的。为什么?我真的不知道...

我认为它不起作用的原因是 Spark JAVA 库在执行这样的映射函数时使用和修改 PYTHONPATH:

import nltk
from pyspark.sql import SQLContext, SparkSession

spark = (SparkSession
         .builder
         .master("local[*]")
         .appName("Nueva")
         .getOrCreate())

sc = spark.sparkContext
sqlContext = SQLContext(sc)

def print_stopwords(x):
    print("\n",x)
    print(nltk.corpus.stopwords.words('english'))
    return x

prueba = sc.parallelize([0,1,2,3])
r = prueba.map(print_stopwords)
r.take(1)

我得到错误

  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\__init__.py", line 143, in <module>
    from nltk.chunk import *
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\chunk\__init__.py", line 157, in <module>
    from nltk.chunk.api import ChunkParserI
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\chunk\api.py", line 13, in <module>
    from nltk.parse import ParserI
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\parse\__init__.py", line 100, in <module>
    from nltk.parse.transitionparser import TransitionParser
  File "C:\ProgramData\Anaconda3\lib\site-packages\nltk\parse\transitionparser.py", line 22, in <module>
    from sklearn.datasets import load_svmlight_file
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\datasets\__init__.py", line 22, in <module>
    from .twenty_newsgroups import fetch_20newsgroups
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\datasets\twenty_newsgroups.py", line 44, in <module>
    from ..feature_extraction.text import CountVectorizer
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\__init__.py", line 10, in <module>
    from . import text
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 28, in <module>
    from ..preprocessing import normalize
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\__init__.py", line 6, in <module>
    from ._function_transformer import FunctionTransformer
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\preprocessing\_function_transformer.py", line 5, in <module>
    from ..utils.testing import assert_allclose_dense_sparse
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\utils\testing.py", line 718, in <module>
    import pytest
  File "C:\ProgramData\Anaconda3\lib\site-packages\pytest.py", line 6, in <module>
    from _pytest.assertion import register_assert_rewrite
  File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\assertion\__init__.py", line 7, in <module>
    from _pytest.assertion import rewrite
  File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\assertion\rewrite.py", line 26, in <module>
    from _pytest.assertion import util
  File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\assertion\util.py", line 8, in <module>
    import _pytest._code
  File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\_code\__init__.py", line 2, in <module>
    from .code import Code  # noqa
  File "C:\ProgramData\Anaconda3\lib\site-packages\_pytest\_code\code.py", line 23, in <module>
    import pluggy
  File "C:\ProgramData\Anaconda3\lib\site-packages\pluggy\__init__.py", line 16, in <module>
    from .manager import PluginManager, PluginValidationError
  File "C:\ProgramData\Anaconda3\lib\site-packages\pluggy\manager.py", line 11, in <module>
    import importlib_metadata
  File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 549, in <module>
    __version__ = version(__name__)
  File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 511, in version
    return distribution(distribution_name).version
  File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 482, in distribution
    return Distribution.from_name(distribution_name)
  File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 183, in from_name
    dist = next(dists, None)
  File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 425, in <genexpr>
    for path in map(cls._switch_path, paths)
  File "C:\ProgramData\Anaconda3\lib\site-packages\importlib_metadata\__init__.py", line 449, in _search_path
    if not root.is_dir():
  File "C:\ProgramData\Anaconda3\lib\pathlib.py", line 1351, in is_dir
    return S_ISDIR(self.stat().st_mode)
  File "C:\ProgramData\Anaconda3\lib\pathlib.py", line 1161, in stat
    return self._accessor.stat(self)
OSError: [WinError 123] The file name, directory name or volume label syntax is not correct: 'C:\\C:\\Enviroments\\spark-2.3.2-bin-hadoop2.7\\jars\\spark-core_2.11-2.3.2.jar'

我从 pathlib.py 和 importlib_metadata.py 打印环境变量并获取 PYTHONPATH 值,如下所示:

'PYTHONPATH': 'C:\\Enviroments\\spark-2.3.2-bin-hadoop2.7\\python\\lib\\pyspark.zip;C:\\Enviroments\\spark-2.3.2-bin-hadoop2.7\\python\\lib\\py4j-0.10.7-src.zip;/C:/Enviroments/spark-2.3.2-bin-hadoop2.7/jars/spark-core_2.11-2.3.2.jar'

我尝试编辑函数内部、外部和所有方式的路径......但在某些时候 Spark 序列化函数并编辑 PYTHONPATH......在 python 文件中没有,在 Java 文件中,我无法调试这段代码,因为火花在一个容器内运行,由于我的 IDE(Intellij Idea)的很多复杂原因,我无法输入一个 ip 和端口。

不起作用的原因是因为这个斜线 -> /C:/Enviroments/spark-2.3.2-bin-hadoop2.7/jars/spark-core_2.11-2.3.2.jar'。python 将此斜杠在 windows 中解释为绝对路径,并在路径的开头添加磁盘标签,/C: => C:/C:/。然后在执行时,它会生成这条路线显然不存在的错误。

请帮我!提前致谢 :)

标签: pythonwindowsapache-sparkpysparknltk

解决方案


我在使用 pytest 时遇到了同样的问题。对于 Windows 中的格式错误的路径,我没有适当的解决方案。

您可以这样对其应用快速修复:

for path in list(sys.path):
    if not os.path.exists(path):
        sys.path.remove(path)

你至少会摆脱错误。


推荐阅读