regex - 在apache spark中使用rlike时是否可以知道执行了哪个正则表达式
问题描述
前任:
val surveyDF = List(
("I like pizza"),
("I love French fries"),
("Milkshake is so cute"),
("Icecream is yummy")
).toDF("survey")
val items = List("piz.*", "Ice.*")
我想知道有多少人喜欢披萨和冰淇淋。
借助apache spark中提供的帮助rlike功能,我能够得到结果
val resutl = surveyDF
.withColumn(
"contains_items",
col("survey").rlike(items.mkString("|"))
)
.show(truncate = false)
结果:
+-------------------+-------------------+
|survey |contains_items |
+-------------------+-------------------+
|I like pizza |true |
|I love French fries|false |
|Milkshake is cute |false |
|Ice cream is yummy |true |
+-------------------+-------------------+
正如我们所知, rlike 只会返回true 或 false,我想知道是否有任何选项可以让哪个正则表达式执行为 true,
预期成绩:
+-------------------+-------------------+----------+
|survey |contains_items |regex |
+-------------------+-------------------+----------+
|I like pizza |true |piz.* |
|I love French fries|false |null |
|Milkshake is cute |false |null |
|Icecream is yummy |true |Ice.* |
+-------------------+-------------------+----------+
解决方案
scala> import org.apache.spark.sql.expressions.UserDefinedFunction
scala> surveyDF.show()
+-----------------------------+
|survey |
+-----------------------------+
|I like pizza |
|I love French fries |
|Milkshake is so cute |
|Icecream is yummy |
|pizza and Icecreams are yummy|
+-----------------------------+
scala> def MatchWord:UserDefinedFunction = udf((line:String,pattern:String) => {
| var out = ""
| import scala.util.matching.Regex
| val patternList = pattern.split("~").toList
| patternList.foreach{ x =>
| val patternRgx = new Regex(x)
| val patternCheck = (patternRgx findAllIn line).mkString(",")
| if(patternCheck != "")
| {out = out + "," + x}
| }
| out.replaceFirst(s""",""","") })
scala> val items = List("piz.*", "Ice.*")
scala> surveyDF.withColumn("contains_items",col("survey").rlike(items.mkString("|")))
.withColumn("regex", when(col("contains_items"), MatchWord(col("survey"),lit(items.mkString("~")))))
.show(false)
+-----------------------------+--------------+-----------+
|survey |contains_items|regex |
+-----------------------------+--------------+-----------+
|I like pizza |true |piz.* |
|I love French fries |false |null |
|Milkshake is so cute |false |null |
|Icecream is yummy |true |Ice.* |
|pizza and Icecreams are yummy|true |piz.*,Ice.*|
+-----------------------------+--------------+-----------+
推荐阅读
- matlab - 在 MATLAB 报告生成器中拆分表并设置有效数字
- javascript - 引导工具提示的定时隐藏不适用于快速光标移动
- javascript - 如何从 Firestore 获取数据到谷歌云功能?
- python - 使用 Pandas 计算的价格指标应该在哪里“放置”?
- c# - 如何在 Unity 3D 中实现与 Temple Run 完全一样的相机跟随?
- c# - 在 blazor 中发布会返回异常,指出“无法将 JSON 值转换为 System.Int32”。
- javascript - MODULE_NOT_FOUND 用于自适应卡片模板
- xml - 在 Oracle 中提取 XML 数据时总是出现空白
- typescript - 打字稿:未正确推断泛型类型(`未知`)
- javascript - 如何在 for 循环中创建动态状态?