首页 > 解决方案 > R中的部分字符串匹配将文本统一为一个类别

问题描述

我有如下数据集

EstablishmentName                    Freq
bahria university                    20 
bahria university islamabad          12
arid agriculture                     3
arid agriculture university          15
arid rawalpindi                      9
college of e&me, nust                20
college of e & me (nust)             15
college of eme                       30

正如您在上面看到的那样,Bahria University 和 Bahria University Islamabad 几乎相同,其他字符串也是如此。我想把它们统一成一个这样

预期产出

EstablishmentName                   Freq
Bahria University                   32
Arid Agriculture                    27
College of EME                      30

我尝试了以下解决方案,但似乎不起作用。

     library(SnowballC)
     library(dplyr)

    mutate(df, word = wordStem(EstablishmentName)) %>%
      group_by(EstablishmentName) %>%
      summarise(total = sum(Freq))

标签: rdplyrdata.table

解决方案


推荐阅读