scala - 通过使用 Levenshtein 算法与另一列中的现有数据进行比较来更新数据框列
问题描述
如何使用 Levenshtein 算法更新 m_name 列以替换空值?
+--------------------+--------------------+-------------------+
| original_name| m_name| created|
+--------------------+--------------------+-------------------+
| New York| New York|2017-08-01 09:33:40|
| new york| null|2017-08-01 15:15:06|
| New York city| null|2017-08-01 15:15:06|
| california| California|2017-09-01 09:33:40|
| California,000IU...| null|2017-09-01 01:40:00|
| Californiya| California|2017-09-01 11:38:21|
对于每个“original_name”值,应采用基于 Levenshtein 距离(编辑距离)的算法建立的第一个最近的“m_name”值。
similarity(s1,s2) = [max(len(s1), len(s2)) − editDistance(s1,s2)] / max(len(s1), len(s2))
“理想”的最终结果应该是这样的
+--------------------+--------------------+-------------------+
| original_name| m_name| created|
+--------------------+--------------------+-------------------+
| New York| New York|2017-08-01 09:33:40|
| new york| New York|2017-08-01 15:15:06|
| New York city| New York|2017-08-01 15:15:06|
| california| California|2017-09-01 09:33:40|
| California,000IU...| California|2017-09-01 01:40:00|
| Californiya| California|2017-09-01 11:38:21|
解决方案
归功于rossettacode Levenshtein_distance
您可以执行以下操作(为清晰和解释而评论)
//collecting the m_name to unique set and filtering out nulls and finally broadcasting to be used in udf function
import org.apache.spark.sql.functions._
val collectedList = df.select(collect_set("m_name")).rdd.collect().flatMap(row => row.getAs[Seq[String]](0).filterNot(_ == "null")).toList
val broadcastedList = sc.broadcast(collectedList)
//levenshtein distance formula applying
import scala.math.{min => mathmin, max => mathmax}
def minimum(i1: Int, i2: Int, i3: Int) = mathmin(mathmin(i1, i2), i3)
def editDistance(s1: String, s2: String) = {
val dist = Array.tabulate(s2.length + 1, s1.length + 1) { (j, i) => if (j == 0) i else if (i == 0) j else 0 }
for (j <- 1 to s2.length; i <- 1 to s1.length)
dist(j)(i) = if (s2(j - 1) == s1(i - 1)) dist(j - 1)(i - 1)
else minimum(dist(j - 1)(i) + 1, dist(j)(i - 1) + 1, dist(j - 1)(i - 1) + 1)
dist(s2.length)(s1.length)
}
//udf function definition to find the levenshtein distance and finding the closest first match from the broadcasted list with original_name column
def levenshteinUdf = udf((str1: String)=> {
val distances = for(str2 <- broadcastedList.value) yield (str2, editDistance(str1.toLowerCase, str2.toLowerCase))
distances.minBy(_._2)._1
})
//calling the udf function when m_name is null
df.withColumn("m_name", when(col("m_name").isNull || col("m_name") === "null", levenshteinUdf(col("original_name"))).otherwise(col("m_name"))).show(false)
这应该给你
+-------------------+----------+-------------------+
|original_name |m_name |created |
+-------------------+----------+-------------------+
|New York |New York |2017-08-01 09:33:40|
|new york |New York |2017-08-01 15:15:06|
|New York city |New York |2017-08-01 15:15:06|
|california |California|2017-09-01 09:33:40|
|California,000IU...|California|2017-09-01 01:40:00|
|Californiya |California|2017-09-01 11:38:21|
+-------------------+----------+-------------------+
注意:我没有使用你的similarity(s1,s2) = [max(len(s1), len(s2)) − editDistance(s1,s2)] / max(len(s1), len(s2))
逻辑作为它给出错误的输出
推荐阅读
- scala - 对象 TagFragments 不是包 org.specs2.specification 的成员
- python - 如何展平 3 维数组
- android - 发音评估的音素检测
- android - 从 println (Android/Klaxon) 中删除方括号
- javascript - Vue.js 没有设置数据对象数组中声明的属性
- c# - Entity Framework 6.2 将多对多从一个 DbContext 复制到另一个 DbContext
- javascript - 如何根据 localStorage input.value 渲染列表?
- google-cloud-platform - 同时在多个虚拟机上安装监控代理
- botframework - 如何将背景图像添加到自适应卡片机器人框架
- razor - 在 @Html.TextBoxFor 中使用 css 选择器