首页 > 解决方案 > 如何腌制 Spacy 模型以在 PySpark 函数中使用

问题描述

我正在运行定义的 spacy 匹配器模型:

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType, ArrayType
import spacy
from spacy.matcher import Matcher

nlp = spacy.load("en_core_web_sm")


def spacy_matcher(text):
    doc = nlp(text)
    matcher = Matcher(nlp.vocab)
    matcher.add("NounChunks", None,  [{"POS": "NOUN", "OP": "+"}] )
    matches = matcher(doc)
    spans = [doc[start:end] for _, start, end in matches]
    return [spacy.util.filter_spans(spans)]

matcher2 = udf(spacy_matcher, ArrayType(StringType()))

当我尝试将此 udf 应用于新列时:

test = reviews.withColumn('chunk',matcher2('SENTENCE'))
test.show()

我收到一个酸洗错误:

NotImplementedError: [E112] Pickling a span is not supported, because spans are only views of the parent Doc and can't exist on their own. A pickled span would always have to include its Doc and Vocab, which has practically no advantage over pickling the parent Doc directly. So instead of pickling the span, pickle the Doc it belongs to or use Span.as_doc to convert the span to a standalone Doc object.

我对酸洗做的不多,并且不确定如何处理这个问题,因为我确实想保留跨度,因为这些是定义我的块的原因。知道如何正确腌制吗?

标签: pysparkpickleuser-defined-functionsspacy

解决方案


推荐阅读