r - grep() 在 for 循环中不完全匹配,只匹配确切的字符
问题描述
我有一个 DTnames_nightlight
具有如下所示的标准区域名称。另一个 DT disasters
,其中一个 column Location
,具有标准和非标准区域名称以及城市/直辖市名称。我想用标准区域名称替换disasters$Location
非标准区域名称names_nightlight$region
。
名称_夜灯:
| country | region | ISO |
|-------------------|-------------|-----|
| American Samoa | Eastern | ASM |
| American Samoa | Manu'a | ASM |
| American Samoa | Unorganized | ASM |
| American Samoa | Western | ASM |
| Antigua & Barbuda | Barbuda | ATG |
| Antigua & Barbuda | Redonda | ATG |
| Antigua & Barbuda | Saint George| ATG |
| ... | ... | ... |
我需要使用grep()
来查找匹配项,其中disasters$Location
有区域名称,然后制作disasters$Location := names_nightlight$region
(标准名称)并用disasters$matched := 1
. 稍后,我可以使用 Google 手动找到灾难中的城市/直辖市 $Location 的区域。
for (j in names_nightlight[!region == "just one region", ISO]){
for (i in names_nightlight[ISO == j, region]){
disasters[ISO == j][grep(i, Location), Location := i]
disasters[ISO == j & Location == i, matched := 1]
}
}
但是,我的循环中的 grep 函数似乎没有完全运行,只有确切的字符匹配。例如,“Manu'a island”与“Manu'a”不匹配,“Saint George”(以空格结尾)与“Saint George”(不以空格结尾)不匹配。
在没有匹配的结果中
disasters[is.na(matched) == TRUE]
| Start.date | End.date | ISO | Location | Disaster.No. | matched |
|------------|------------|-----|---------------|---------------|---------|
| 2005-02-16 | 2005-02-16 | ASM | Manu'a island | 2005-0151 | NA |
| 2017-09-06 | 2017-09-06 | ATG | Saint George | 2017-0381 | NA |
| 2017-09-06 | 2017-09-06 | ATG | Crosbies | 2017-0381 | NA |
| 2017-09-06 | 2017-09-06 | ATG | Fort Road | 2017-0381 | NA |
| 2017-09-06 | 2017-09-06 | ATG | Clare Hall | 2017-0381 | NA |
| 2017-09-06 | 2017-09-06 | ATG | Grays Farm | 2017-0381 | NA |
| ... | ... | ... | ... | ... | ... |
输入(names_nightlight[1:10])
structure(list(country = c("American Samoa", "American Samoa",
"American Samoa", "American Samoa", "Antigua & Barbuda", "Antigua & Barbuda",
"Antigua & Barbuda", "Antigua & Barbuda", "Antigua & Barbuda",
"Antigua & Barbuda"), region = c("Eastern", "Manu'a", "Unorganized",
"Western", "Barbuda", "Redonda", "Saint George", "Saint John",
"Saint Mary", "Saint Paul"), ISO = c("ASM", "ASM", "ASM", "ASM",
"ATG", "ATG", "ATG", "ATG", "ATG", "ATG")), row.names = c(NA,
-10L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x7f835380dae0>)
输入(灾难[1:10])
structure(list(Start.date = structure(c(12422, 12422, 12422,
12422, 12830, 17415, 14167, 14167, 14167, 14167), class = "Date"),
End.date = structure(c(12422, 12422, 12422, 12422, 12830,
17415, 14168, 14168, 14168, 14168), class = "Date"), Country = c("American Samoa",
"American Samoa", "American Samoa", "American Samoa", "American Samoa",
"Anguilla", "Antigua and Barbuda", "Antigua and Barbuda",
"Antigua and Barbuda", "Antigua and Barbuda"), ISO = c("ASM",
"ASM", "ASM", "ASM", "ASM", "AIA", "ATG", "ATG", "ATG", "ATG"
), Location = c("Eastern", "Manu'a", "Unorganized", "Western",
"Manu'a island", "just one region", "Barbuda", "Redonda",
"Saint George", "Saint John"), Latitude = c(NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_), Longitude = c(NA_real_, NA_real_, NA_real_,
NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_,
NA_real_), Magnitude.value = c(310, 310, 310, 310, NA, NA,
NA, NA, NA, NA), Magnitude.scale = c("Kph", "Kph", "Kph",
"Kph", "Kph", "Kph", "Kph", "Kph", "Kph", "Kph"), Disaster.type = c("Storm",
"Storm", "Storm", "Storm", "Storm", "Storm", "Storm", "Storm",
"Storm", "Storm"), Disaster.subtype = c("Tropical cyclone",
"Tropical cyclone", "Tropical cyclone", "Tropical cyclone",
"Tropical cyclone", "Tropical cyclone", "Tropical cyclone",
"Tropical cyclone", "Tropical cyclone", "Tropical cyclone"
), Associated.disaster = c("--", "--", "--", "--", "--",
"--", "Flood", "Flood", "Flood", "Flood"), Associated.disaster2 = c("--",
"--", "--", "--", "--", "--", "--", "--", "--", "--"), Total.deaths = c(0L,
0L, 0L, 0L, 0L, 4L, 0L, 0L, 0L, 0L), Total.affected = c(23060L,
23060L, 23060L, 23060L, 0L, 15000L, 25800L, 25800L, 25800L,
25800L), Total.damage...000.US.. = c(150000L, 150000L, 150000L,
150000L, 0L, 200000L, 0L, 0L, 0L, 0L), Insured.losses...000.US.. = c(0,
0, 0, 0, 0, 6700, 0, 0, 0, 0), Disaster.name = c("Heta",
"Heta", "Heta", "Heta", "Olaf", "Hurricane 'Irma'", "Hurricane \"Omar\"",
"Hurricane \"Omar\"", "Hurricane \"Omar\"", "Hurricane \"Omar\""
), Disaster.No. = c("2004-0004", "2004-0004", "2004-0004",
"2004-0004", "2005-0151", "2017-0381", "2008-0604", "2008-0604",
"2008-0604", "2008-0604"), empty_region = c(0, 0, 0, 0, 0,
1, 0, 0, 0, 0), matched = c(NA, NA, NA, NA, NA, 1, NA, NA,
NA, NA)), .internal.selfref = <pointer: 0x7f835380dae0>, row.names = c(NA,
-10L), class = c("data.table", "data.frame"))
解决方案
不完全确定您想要实现的目标,但请注意:
disasters[ISO == j][grep(i, Location), Location := i]
'nothing'disasters[ISO == j]
是否返回子集 data.table 但您不将其分配给任何变量,然后您[grep(i, Location), Location := i]
对未分配给任何变量的对象执行操作。这不一样:
DT[some subseting, new_var := ...]
DT[some subseting][new_var := ...]
阅读 的Note
部分?":="
。所以尝试替换:
disasters[ISO == j][grep(i, Location), Location := i]
disasters[ISO == j & Location == i, matched := 1]
和:
disasters[ISO == j & str_detect(Location, i), ":="(Location = i, matched = 1)]
推荐阅读
- google-sheets - 谷歌表格过滤大型数据集
- visual-studio-code - 如何在 VSCode 中全局搜索制表符?
- javascript - 如何推送到嵌套在对象中的数组?
- android - 如何在android studio中做智能动画
- c# - 使用foreach循环时重复的Json项目
- ios - Swift/xcode:在标签栏点击时重新加载视图(但不是在按下时)
- python - 我可以在 python 中为语言检测 API 使用多种语言吗?
- redis - 如何使用cube.js连接特定的redis
- python - 如何将来自不同数据框的值添加在一起?
- php - 尝试访问符号链接时如何修复 403 错误?