r - 有没有办法将一个列中的字符串与 R 中另一列中的几个字符串完全匹配?
问题描述
我想将R中一列中的字符串与另一列中用“,”逗号分隔的字符串匹配
我在 R 中有两个数据框:
General_df
Main_cat gen_cat
Fruits apple
Fruits mango
Fruits strawberry
Vegetable potato
Vegetable lettuce
Vegetable onion
Liquids water
Liquids milk
Liquids juice
Tech app
Object straw
My_dataframe
Days cat
Day 1 apple, potato, milk
Day 2 onion, water
Day 3 strawberry, potato
Day 4 straw, mango
我想为“My_dataframe”获取 Main_cat,所以我设法得到了这个:
Days cat Match_string Main_cat
Day 1 apple, potato, milk apple Fruits
Day 1 apple, potato, milk potato Vegetable
Day 1 apple, potato, milk app Tech
Day 1 apple, potato, milk milk Liquids
它也匹配子字符串“app”,并且我的数据框中的多行有几个这样的子字符串匹配
但是,我只希望它完全匹配由“,”分隔的“cat”列中的整个字符串
Days cat Match_string Main_cat
Day 1 apple, potato, milk apple Fruits
Day 1 apple, potato, milk potato Vegetable
Day 1 apple, potato, milk milk Liquids
有没有办法在这个场景中找到一个完全匹配的字符串?谢谢!
General_df <- read.table(text='
Main_cat gen_cat
Fruits apple
Fruits mango
Fruits strawberry
Vegetable potato
Vegetable lettuce
Vegetable onion
Liquids water
Liquids milk
Liquids juice
Tech app
Object straw', header=TRUE, stringsAsFactors = FALSE)
My_dataframe <- read.table(text='
Days; cat
Day 1; apple, potato, milk
Day 2; onion, water
Day 3; strawberry, potato
Day 4 ; straw, mango', sep=';', header=TRUE, stringsAsFactors = FALSE)
My_dataframe[] <- lapply(My_dataframe, trimws)
解决方案
我想这就是你所追求的:
library(dplyr); library(tidyr)
My_dataframe %>%
## Split cat variable up into individual strings as a list column
mutate(Match_string = strsplit(cat, ',\\s+')) %>%
## unnest the list into a long/tall data frame
unnest(Match_string) %>%
## Join the lookup/key onto the tall/long data on the split column
left_join(General_df, by = c('Match_string' = 'gen_cat'))
## Days cat Match_string Main_cat
## <chr> <chr> <chr> <chr>
## 1 Day 1 apple, potato, milk apple Fruits
## 2 Day 1 apple, potato, milk potato Vegetable
## 3 Day 1 apple, potato, milk milk Liquids
## 4 Day 2 onion, water onion Vegetable
## 5 Day 2 onion, water water Liquids
## 6 Day 3 strawberry, potato strawberry Fruits
## 7 Day 3 strawberry, potato potato Vegetable
## 8 Day 4 straw, mango straw Object
## 9 Day 4 straw, mango mango Fruits
还有一个基本的 R 方法来确保我不会太依赖:
Match_string <- strsplit(My_dataframe$cat, ',\\s+')
data.frame(
My_dataframe[rep(seq_len(nrow(My_dataframe)), lengths(Match_string)),],
Match_string = unlist(Match_string),
Main_cat = General_df$Main_cat[match(unlist(Match_string), General_df$gen_cat)],
stringsAsFactors = FALSE,
row.names = NULL
)
## Days cat Match_string Main_cat
## 1 Day 1 apple, potato, milk apple Fruits
## 2 Day 1 apple, potato, milk potato Vegetable
## 3 Day 1 apple, potato, milk milk Liquids
## 4 Day 2 onion, water onion Vegetable
## 5 Day 2 onion, water water Liquids
## 6 Day 3 strawberry, potato strawberry Fruits
## 7 Day 3 strawberry, potato potato Vegetable
## 8 Day 4 straw, mango straw Object
## 9 Day 4 straw, mango mango Fruits
或者data.table如果速度和内存是你的事:
library(data.table)
merge(
data.table(My_dataframe)[, Match_string := strsplit(cat, ',\\s+')][,
.(Match_string =unlist(Match_string)), by = c('Days', 'cat')],
General_df, by.x = 'Match_string', by.y = 'gen_cat',
all.x = TRUE
)[order(Days), .(Days, cat, Match_string, Main_cat)]
## Days cat Match_string Main_cat
## 1: Day 1 apple, potato, milk apple Fruits
## 2: Day 1 apple, potato, milk milk Liquids
## 3: Day 1 apple, potato, milk potato Vegetable
## 4: Day 2 onion, water onion Vegetable
## 5: Day 2 onion, water water Liquids
## 6: Day 3 strawberry, potato potato Vegetable
## 7: Day 3 strawberry, potato strawberry Fruits
## 8: Day 4 straw, mango mango Fruits
## 9: Day 4 straw, mango straw Object
推荐阅读
- git - 使用 sed 或 awk 自动记录 gitconfig 别名
- java - spring data JPA:如何执行聚合函数
- reactjs - Graphql/Apollo/React 无法导入本地图片
- android - 是否可以为应用程序设置单个默认语言环境?
- google-analytics - 我设置了 Google Analytics 电子商务跟踪,但数据未填充到 GA
- python - 根据列表从文件夹导入图像 - python
- android - 我在哪里提交这个 ndk 错误?
- r - 在 r 中,pmap 函数对我来说是破坏性的
- python - Python Pandas:数据框中整个列的 NLTK 部分语音标记
- entity-framework - 实体框架 - 始终加密 - AzureKeyVault