首页 > 解决方案 > 从字符串中删除重复的字符串 char 超过一个

问题描述

感谢您的帮助,我正在尝试从 pract_df 数据框中的添加列中删除重复的单词,输出不符合预期。还附加了 udf

scala> pract_df.withColumn("add1", cleanNamePattern(col("add"))).show(false)
    +--------------------------------------------+--------------------------------------------+
    |add                                         |add1                                        |
    +--------------------------------------------+--------------------------------------------+
    |505152 SANAGONDA ODELA PEDDAPALLI 9985250499|505152 SANAGONDA ODELA PEDDAPALLI 9985250499|
    | kurnool nannuru kurnool                    | KURNOOL NANNURU                            |
    |R R district                                |R DISTRICT                                  |
    |J J nagara J J colonty                      |J NAGARA COLONTY                            |
    |sunil reddy reddy                           |SUNIL REDDY                                 |
    |ramesh reddy sunil reddy                    |RAMESH REDDY SUNIL                          |
    +--------------------------------------------+--------------------------------------------+

下面是用于去除add列重复的udf,add1是应用udf后的结果。

val cleanNamePattern1 = udf((data: String) => {
      if (data == null) ""
      else if (data.isEmpty) ""
      else if(data.toLowerCase().contains("reddy","[a-zA-Z]{1} [a-zA-Z]{1}")) data
      else  {
        var result: String = ""
        for (s <- data.toLowerCase.replaceAll(" {1,}"," ").split(" ").toArray.distinct.mkString(" ")) {
          result += ""+s
        }
        result.trim
      }
      data.toLowerCase.replaceAll(" {1,}"," ").split(" ").toArray.distinct.mkString(" ")
        .toUpperCase()
    })

我的预期输入是

505152 SANAGONDA ODELA PEDDAPALLI 9985250499
 KURNOOL NANNURU                            
R R DISTRICT                        
J J NAGARA COLONTY   
SUNIL REDDY                                                           
RAMESH REDDY SUNIL  

我想考虑字符串字符不止一个。

标签: arraysscalaapache-sparkdata-cleaning

解决方案


推荐阅读