首页 > 解决方案 > py4j.protocol.Py4JJavaError: 调用 z:org.apache.spark.mllib.api.python.SerDe.pythonToJava 时出错

问题描述

我正在尝试在 Windows 10 上使用 pySpark 训练 Word2Vec 模型。我通过 pip install 安装了 py4j。

from pyspark import SparkContext, SparkConf
from pyspark.sql import SQLContext
from pyspark.mllib.feature import Word2Vec

conf = SparkConf().setAppName("test")
sc = SparkContext(conf=conf)
sqlCtx = SQLContext(sc)
code_lines = sqlCtx.read.option("multiLine", True).option("mode", "PERMISSIVE").json("\jsons\hi.json")
code_lines = code_lines.repartition(300)

def split_code(input):
    strs = " ".join(input)
    patt = re.compile(r"[\w]", re.UNICODE)
    return patt.findall(strs)

words = code_lines\
    .rdd.map(
        lambda thing: (thing[11].split())
    )\
    .map(lambda line: [f.lower() for f in line])\
    .map(lambda line: split_code(line))\
    .filter(lambda line: line != [])

word2vec = Word2Vec()
word2vec.setMinCount(25)    # Default 100
word2vec.setVectorSize(50)  # Default 100
model = word2vec.fit(words)

但不断收到此错误消息。

Traceback (most recent call last):
  File "C:/Users/daman/Desktop/Work/hermes-master/src/utils/code_etl/testing.py", line 30, in <module>
model = word2vec.fit(words)
  File "C:\Users\daman\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyspark\mllib\feature.py", line 773, in fit
int(self.minCount), int(self.windowSize))
  File "C:\Users\daman\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyspark\mllib\common.py", line 130, in callMLlibFunc
return callJavaFunc(sc, api, *args)
  File "C:\Users\daman\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyspark\mllib\common.py", line 122, in callJavaFunc
args = [_py2java(sc, a) for a in args]
  File "C:\Users\daman\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyspark\mllib\common.py", line 122, in <listcomp>
args = [_py2java(sc, a) for a in args]
  File "C:\Users\daman\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyspark\mllib\common.py", line 75, in _py2java
obj = _to_java_object_rdd(obj)
  File "C:\Users\daman\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyspark\mllib\common.py", line 69, in _to_java_object_rdd
return rdd.ctx._jvm.org.apache.spark.mllib.api.python.SerDe.pythonToJava(rdd._jrdd, True)
  File "C:\Users\daman\AppData\Local\Programs\Python\Python37-32\lib\site-packages\py4j\java_gateway.py", line 1257, in __call__
answer, self.gateway_client, self.target_id, self.name)
  File "C:\Users\daman\AppData\Local\Programs\Python\Python37-32\lib\site-packages\pyspark\sql\utils.py", line 63, in deco
return f(*a, **kw)
  File "C:\Users\daman\AppData\Local\Programs\Python\Python37-32\lib\site-packages\py4j\protocol.py", line 328, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.mllib.api.python.SerDe.pythonToJava.
: java.lang.IllegalArgumentException
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.xbean.asm5.ClassReader.<init>(Unknown Source)
at org.apache.spark.util.ClosureCleaner$.getClassReader(ClosureCleaner.scala:46)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:449)
at org.apache.spark.util.FieldAccessFinder$$anon$3$$anonfun$visitMethodInsn$2.apply(ClosureCleaner.scala:432)
at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
at scala.collection.mutable.HashMap$$anon$1$$anonfun$foreach$2.apply(HashMap.scala:103)
at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap$$anon$1.foreach(HashMap.scala:103)
at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732)
at org.apache.spark.util.FieldAccessFinder$$anon$3.visitMethodInsn(ClosureCleaner.scala:432)
at org.apache.xbean.asm5.ClassReader.a(Unknown Source)
at org.apache.xbean.asm5.ClassReader.b(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.xbean.asm5.ClassReader.accept(Unknown Source)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:262)
at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$14.apply(ClosureCleaner.scala:261)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:261)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:159)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2299)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:798)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:797)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:363)
at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:797)
at org.apache.spark.mllib.api.python.SerDeBase.pythonToJava(PythonMLLibAPI.scala:1349)
at org.apache.spark.mllib.api.python.SerDe.pythonToJava(PythonMLLibAPI.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.base/java.lang.reflect.Method.invoke(Unknown Source)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:282)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.base/java.lang.Thread.run(Unknown Source)

我发现也许 PySpark 没有正确安装,当我在 cmd 中输入 pyspark 时,我得到错误“系统找不到路径”,即使我已经正确设置了路径。有趣的是,它会打印两次“系统找不到路径”。

标签: pythonpysparkwindows-10

解决方案


推荐阅读