r - 在两个数据集之间匹配列表中的值
问题描述
我有两个正在工作的数据集。第一个是:
data_1 <- tribble(
~shop_name, ~sub_category,
"A", "Blu-ray, DVDs, CD",
"B", "Sneakers, Make-up, Blu-ray",
"C", "Camera, Optic, DVDs",
"D", "Flower, Notebooks, Make-up",
)
第二个是:
data_2 <- tribble(
~sub_category, ~main_category,
"Blu-ray", "Electronic",
"DVDs", "Electronic",
"CD", "Electronic",
"Sneakers", "Fashion",
"Make-up", "Fashion",
"Camera", "Electronic",
"Optic", "Health",
"Flower", "Home",
)
现在,我想执行左连接以在 data_1 中添加主类别。最终数据应如下所示:
merged_data <- tribble(
~shop_name, ~sub_category, ~main_category,
"A", "Blu-ray, DVDs, CD", "Electronic, Electronic, Electronic",
"B", "Sneakers, Make-up, Blu-ray", "Fashion, Fashion, Electronic",
"C", "Camera, Optic", "Electronic, Health",
"D", "Flower", "Home"
)
我编码如下所示:
data3 <- left_join(data_1, data_2, by = "sub_category")
但不知何故,main_category 返回了 NA。有人可以帮助我吗?提前致谢。
解决方案
以下是两个data.table
解决方案,记录在案:
代码
您可以直接将 in 中subcategory
的每个字符串data_1
与其对应main_category
的 in匹配data_2
:
require(data.table); setDT(data_1); setDT(data_2)
data_1[, main_category := sapply(sub_category, function(x){
str = unlist(strsplit(x, ', '))
match = as.numeric(sapply(str, function(x) data_2[, which(x == sub_category)]))
data_2[match, paste(main_category, collapse = ', ')]
})]
或者,您转换data_1
为长格式并加入data_2
on sub_category
:
data_1 = data_1[, .(sub_category = unlist(strsplit(sub_category, ', '))), keyby = shop_name] # data_1 to long format
dt_final = merge(data_1, data_2, by = 'sub_category', all = T) # Join data_1 and data_2 on sub_category
dt_final = dt_final[, lapply(.SD, function(x) paste(x, collapse = ', ')), keyby = shop_name]
结果
> data_1
shop_name sub_category main_category
1: A Blu-ray, DVDs, CD Electronic, Electronic, Electronic
2: B Sneakers, Make-up, Blu-ray Fashion, Fashion, Electronic
3: C Camera, Optic, DVDs Electronic, Health, Electronic
4: D Flower, Notebooks, Make-up Home, NA, Fashion
> dt_final
shop_name sub_category main_category
1: A Blu-ray, CD, DVDs Electronic, Electronic, Electronic
2: B Blu-ray, Make-up, Sneakers Electronic, Fashion, Fashion
3: C Camera, DVDs, Optic Electronic, Electronic, Health
4: D Flower, Make-up, Notebooks Home, Fashion, NA
推荐阅读
- excel - 将整个列(基于特定标题)从一个工作表复制到另一个工作表(以其他顺序)在 Excel 365 上不起作用
- sql-server - SELECT INTO ##temptable 从动态@sql
- laravel - Homestead - 此站点拒绝连接
- python - 建议使用近 20 万个标签的文本分类模型
- sql - 我可以在另一个 with 中使用 With 吗?
- sql - SQL 查询以获取匹配 id 数量最多的日历周
- tcl - 如何使用 lsearch 和 regexp 从hierarchical_names 列表中过滤只有n 级层次结构的hierarchy_names
- javascript - 当不再需要输入字段时,如何让错误消息自动出现?
- python - 尝试在 python 2.7 中删除 Unicode 特殊字符时出错 TypeError:期望字符串或其他字符缓冲区对象
- regex - 哪个正则表达式会找到我的日期时间格式?