apache-spark - 如何使用 PySpark 在另一列中查找子字符串列的位置？

问题描述

如果我有一个带有两列的 PySpark DataFrame，text并且subtext，那么 where 肯定subtext会出现在text. 我将如何计算列中的subtext位置text？

输入数据：

+---------------------------+---------+
|           text            | subtext | 
+---------------------------+---------+
| Where is my string?       | is      |
| Hm, this one is different | on      |
+---------------------------+---------+

预期输出：

+---------------------------+---------+----------+
|           text            | subtext | position |
+---------------------------+---------+----------+
| Where is my string?       | is      |       6  |
| Hm, this one is different | on      |       9  |
+---------------------------+---------+----------+

注意：我可以毫无问题地使用静态文本/正则表达式来执行此操作，我无法找到任何资源来使用特定于行的文本/正则表达式来执行此操作。

标签： apache-sparkpysparkapache-spark-sql

您可以使用locate. 您需要减去 1，因为字符串索引从 1 开始，而不是 0。

import pyspark.sql.functions as F

df2 = df.withColumn('position', F.expr('locate(subtext, text) - 1'))

df2.show(truncate=False)
+-------------------------+-------+--------+
|text                     |subtext|position|
+-------------------------+-------+--------+
|Where is my string?      |is     |6       |
|Hm, this one is different|on     |9       |
+-------------------------+-------+--------+

apache-spark - 如何使用 PySpark 在另一列中查找子字符串列的位置？

问题描述

解决方案

推荐阅读