首页 > 解决方案 > 识别字符串中的相似元素

问题描述

我有一个包含在线商店产品的大型数据框,其中几种产品以不同的方式记录,如下所示:

1:  milk 1-liter low, fat
2:  M I L K 1-liter L ow fat
3:  Milk. 1_liter LOW FAT
4:  Milk 1_liter L F A T
5:  MILK 1.5_liter Hi gh FAT

我需要拆分它

我希望从我的数据中得到这个:

 V1   v2        v3    v4
milk  1-liter   low   fat
MILK  1-liter   Low   fat
Milk. 1_liter   LOW   FAT
Milk  1_liter   L     FAT
MILK  1.5_liter High  FAT

标签: r

解决方案


我很想知道其他人是否有更自动化的解决方案,因为我经常有类似的混乱数据。

我知道如何做到这一点的唯一方法是编写一堆正则表达式(通过stringr::str_replace())来协调数据框中的行。您可以使用tidyr::separate()来将您的product列拆分为多个列:

library(stringr)
library(dplyr)
library(tidyr)

dat <- tibble(product = c("milk 1-liter low, fat",
       "M I L K 1-liter L ow fat",
       "Milk. 1_liter LOW FAT",
       "Milk 1_liter L F A T",
       "MILK 1.5_liter Hi gh FAT"))

dat %>%
  mutate(product = str_replace(product, "(milk|MILK|Milk|M I L K|Milk)\\.*", "milk"),
         product = str_replace(product, "(low|LOW|L\\sow|L),*", "low"),
         product = str_replace(product, "(HIGH|Hi\\sgh|H)", "high"),
         product = str_replace(product, "(FAT|Fat|F A T)", "fat"),
         product = str_replace(product, "-liter", "_liter")) %>%
  separate(product, into = c("V1", "V2", "V3", "V4"), sep = " ", extra = "merge")

# A tibble: 5 x 4
  V1    V2        V3    V4   
  <chr> <chr>     <chr> <chr>
1 milk  1_liter   low   fat  
2 milk  1_liter   low   fat  
3 milk  1_liter   low   fat  
4 milk  1_liter   low   fat  
5 milk  1.5_liter high  fat 

推荐阅读