pyspark - 如何腌制 Spacy 模型以在 PySpark 函数中使用
问题描述
我正在运行定义的 spacy 匹配器模型:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, ArrayType
import spacy
from spacy.matcher import Matcher
nlp = spacy.load("en_core_web_sm")
def spacy_matcher(text):
doc = nlp(text)
matcher = Matcher(nlp.vocab)
matcher.add("NounChunks", None, [{"POS": "NOUN", "OP": "+"}] )
matches = matcher(doc)
spans = [doc[start:end] for _, start, end in matches]
return [spacy.util.filter_spans(spans)]
matcher2 = udf(spacy_matcher, ArrayType(StringType()))
当我尝试将此 udf 应用于新列时:
test = reviews.withColumn('chunk',matcher2('SENTENCE'))
test.show()
我收到一个酸洗错误:
NotImplementedError: [E112] Pickling a span is not supported, because spans are only views of the parent Doc and can't exist on their own. A pickled span would always have to include its Doc and Vocab, which has practically no advantage over pickling the parent Doc directly. So instead of pickling the span, pickle the Doc it belongs to or use Span.as_doc to convert the span to a standalone Doc object.
我对酸洗做的不多,并且不确定如何处理这个问题,因为我确实想保留跨度,因为这些是定义我的块的原因。知道如何正确腌制吗?
解决方案
推荐阅读
- amazon-web-services - 自定义角色(Lambda)如何与 EC2 角色策略一起使用?
- php - Slim 测试错误:函数名必须是字符串
- sql - 从 3 个 CSV 文件中排序和选择
- regex - 无法找出正则表达式来匹配这些情况
- facebook - Organic Post 分享 Instagram Graph API
- rest - Sabre RevalidateItinerary:什么是 ClassOfService 代码?
- bash - MacOS终端中按文件创建日期降序对grep输出进行排序的命令是什么?
- excel - 我怎样才能避免复制/粘贴问题,该问题会巧妙地粘贴与复制的范围不同的范围?
- python - Django:有没有办法让客户端在本地渲染 Python 3 函数?
- python - 如何使用 node_def 在 Tensorflow 中复制图形操作?