首页 > 解决方案 > Python/PySpark spacy 返回数组而不是单个字符串

问题描述

我正在尝试使用 spacy 对文本进行标记,并希望将标记字符串转换为数组。目前使用:

from pyspark.sql.functions import udf
import spacy
nlp = spacy.load("en_core_web_sm")

def spacy_tokenizer(text):
    doc = nlp(text)
    return [token.text for token in doc]
tokenize = udf(spacy_tokenizer)

df2 = df.withColumn('TOKEN', tokenize('SENTENCE'))

from pyspark.sql.functions import array
df3 = df2.withColumn("TOKEN_ARRAY", array('TOKEN'))
df3.show()
+---------------+---------------------+-----------------------+
|  SENTENCE     |  TOKEN              | TOKEN_ARRAY           |
+---------------+---------------------+-----------------------+
|  Cool to wear.|  [Cool, to, wear, .]| [[Cool, to, wear, .]] |
+---------------+---------------------+-----------------------+

它正在创建一个包含一个元素的数组,该元素是完整的字符串,而我想要一个包含 4 个元素的数组(每个单独的标记作为一个元素)。通过使用数组 contains 对此进行了测试,它仅在我搜索整个字符串时显示为真,在我搜索单个标记时显示为假。

from pyspark.sql.functions import array_contains
df4=df3.withColumn("test", array_contains("TOKEN_ARRAY", "[Cool, to, wear, .]")).show()
+---------------+---------------------+-----------------------+-------+
|  SENTENCE     |  TOKEN              | TOKEN_ARRAY           | test  |
+---------------+---------------------+-----------------------+-------+
|  Cool to wear.|  [Cool, to, wear, .]| [[Cool, to, wear, .]] | true  |
+---------------+---------------------+-----------------------+-------+


df4=df3.withColumn("test", array_contains("TOKEN_ARRAY", "Cool")).show()
+---------------+---------------------+-----------------------+-------+
|  SENTENCE     |  TOKEN              | TOKEN_ARRAY           | test  |
+---------------+---------------------+-----------------------+-------+
|  Cool to wear.|  [Cool, to, wear, .]| [[Cool, to, wear, .]] | false |
+---------------+---------------------+-----------------------+-------+

标签: pythonarrayspysparkapache-spark-sql

解决方案


需要以这种方式指定数组类型,然后它才起作用。

tokenize = udf(spacy_tokenizer, ArrayType(StringType()))

推荐阅读