python - 在 EMR 上运行 sparknlp DocumentAssembler
问题描述
我正在尝试在 EMR 上运行 sparknlp。我登录到我的 zeppelin 笔记本并运行以下命令
import sparknlp
spark = SparkSession.builder \
.appName("BBC Text Categorization")\
.config("spark.driver.memory","8G")\
.config("spark.memory.offHeap.enabled",True)\
.config("spark.memory.offHeap.size","8G") \
.config("spark.driver.maxResultSize", "2G") \
.config("spark.jars.packages", "JohnSnowLabs:spark-nlp:2.4.5")\
.config("spark.kryoserializer.buffer.max", "1000M")\
.config("spark.network.timeout","3600s")\
.getOrCreate()
from sparknlp.base import DocumentAssembler
documentAssembler = DocumentAssembler()\
.setInputCol("description") \
.setOutputCol('document')
这导致了以下错误:
Fail to execute line 1: documentAssembler = DocumentAssembler()\
Traceback (most recent call last):
File "/tmp/zeppelin_pyspark-4581426413302524147.py", line 380, in <module>
exec(code, _zcUserQueryNameSpace)
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 110, in wrapper
return func(self, **kwargs)
File "/usr/local/lib/python3.6/site-packages/sparknlp/base.py", line 148, in __init__
super(DocumentAssembler, self).__init__(classname="com.johnsnowlabs.nlp.DocumentAssembler")
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/__init__.py", line 110, in wrapper
return func(self, **kwargs)
File "/usr/local/lib/python3.6/site-packages/sparknlp/internal.py", line 72, in __init__
self._java_obj = self._new_java_obj(classname, self.uid)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/ml/wrapper.py", line 67, in _new_java_obj
return java_obj(*java_args)
TypeError: 'JavaPackage' object is not callable
为了理解这个问题,我尝试登录到 master 并在 pyspark 控制台中运行上述命令。一切运行良好,如果我使用以下命令启动 pyspark 控制台,我不会收到上述错误:
pyspark --packages JohnSnowLabs:spark-nlp:2.4.5
但是我在使用该命令时遇到与以前相同的错误pyspark
如何在我的 zeppelin 笔记本上完成这项工作?
设置细节:
EMR 5.27.0
spark 2.4.4
openjdk version "1.8.0_272"
OpenJDK Runtime Environment (build 1.8.0_272-b10)
OpenJDK 64-Bit Server VM (build 25.272-b10, mixed mode)
这是我的引导脚本:
#!/bin/bash
sudo yum install -y python36-devel python36-pip python36-setuptools python36-virtualenv
sudo python36 -m pip install --upgrade pip
sudo python36 -m pip install pandas
sudo python36 -m pip install boto3
sudo python36 -m pip install re
sudo python36 -m pip install spark-nlp==2.7.2
解决方案
推荐阅读
- sql - 从oracle返回一组记录到sql server
- c# - 在 Visual Studio 中访问 .NET 5
- c - 如何故意导致 Windows 堆损坏?
- c - 无论 VTIME 中设置了什么,POSIX read() 调用都将永远被阻塞
- android - 为 Android 应用创建资产包时出现 Gradle 错误
- javascript - 如何使用来自对象而不是来自 ajax 请求的 kendo 数据源更新来刷新表
- smalltalk - Smalltalk:“&”和“and:”有什么区别
- blit - OpenGL 4.1 在屏幕空间、Blitting 或顶点中最快渲染 2D 瓦片(或等宽位图字体)?
- javascript - JS文件在asp.net mvc中的页面重新加载时不加载控件
- python - Pytorch DataLoder 非常慢