首页 > 解决方案 > grep() 在 for 循环中不完全匹配,只匹配确切的字符

问题描述

我有一个 DTnames_nightlight具有如下所示的标准区域名称。另一个 DT disasters,其中一个 column Location,具有标准和非标准区域名称以及城市/直辖市名称。我想用标准区域名称替换disasters$Location非标准区域名称names_nightlight$region

名称_夜灯:

|      country      |   region    | ISO |
|-------------------|-------------|-----|
| American Samoa    | Eastern     | ASM |
| American Samoa    | Manu'a      | ASM |
| American Samoa    | Unorganized | ASM |
| American Samoa    | Western     | ASM |
| Antigua & Barbuda | Barbuda     | ATG |
| Antigua & Barbuda | Redonda     | ATG |
| Antigua & Barbuda | Saint George| ATG |
| ...               | ...         | ... |

我需要使用grep()来查找匹配项,其中disasters$Location有区域名称,然后制作disasters$Location := names_nightlight$region(标准名称)并用disasters$matched := 1. 稍后,我可以使用 Google 手动找到灾难中的城市/直辖市 $Location 的区域。

for (j in names_nightlight[!region == "just one region", ISO]){
    for (i in names_nightlight[ISO == j, region]){
        disasters[ISO == j][grep(i, Location), Location := i]
        disasters[ISO == j & Location == i, matched := 1]
    }
}

但是,我的循环中的 grep 函数似乎没有完全运行,只有确切的字符匹配。例如,“Manu'a island”与“Manu'a”不匹配,“Saint George”(以空格结尾)与“Saint George”(不以空格结尾)不匹配。

在没有匹配的结果中

disasters[is.na(matched) == TRUE]

| Start.date |  End.date  | ISO |   Location    |  Disaster.No. | matched |
|------------|------------|-----|---------------|---------------|---------|
| 2005-02-16 | 2005-02-16 | ASM | Manu'a island | 2005-0151     | NA      |
| 2017-09-06 | 2017-09-06 | ATG | Saint George  | 2017-0381     | NA      |
| 2017-09-06 | 2017-09-06 | ATG | Crosbies      | 2017-0381     | NA      |
| 2017-09-06 | 2017-09-06 | ATG | Fort Road     | 2017-0381     | NA      |
| 2017-09-06 | 2017-09-06 | ATG | Clare Hall    | 2017-0381     | NA      |
| 2017-09-06 | 2017-09-06 | ATG | Grays Farm    | 2017-0381     | NA      |
| ...        | ...        | ... | ...           | ...           | ...     |

输入(names_nightlight[1:10])

structure(list(country = c("American Samoa", "American Samoa", 
"American Samoa", "American Samoa", "Antigua & Barbuda", "Antigua & Barbuda", 
"Antigua & Barbuda", "Antigua & Barbuda", "Antigua & Barbuda", 
"Antigua & Barbuda"), region = c("Eastern", "Manu'a", "Unorganized", 
"Western", "Barbuda", "Redonda", "Saint George", "Saint John", 
"Saint Mary", "Saint Paul"), ISO = c("ASM", "ASM", "ASM", "ASM", 
"ATG", "ATG", "ATG", "ATG", "ATG", "ATG")), row.names = c(NA, 
-10L), class = c("data.table", "data.frame"), .internal.selfref = <pointer: 0x7f835380dae0>)

输入(灾难[1:10])

structure(list(Start.date = structure(c(12422, 12422, 12422, 
12422, 12830, 17415, 14167, 14167, 14167, 14167), class = "Date"), 
    End.date = structure(c(12422, 12422, 12422, 12422, 12830, 
    17415, 14168, 14168, 14168, 14168), class = "Date"), Country = c("American Samoa", 
    "American Samoa", "American Samoa", "American Samoa", "American Samoa", 
    "Anguilla", "Antigua and Barbuda", "Antigua and Barbuda", 
    "Antigua and Barbuda", "Antigua and Barbuda"), ISO = c("ASM", 
    "ASM", "ASM", "ASM", "ASM", "AIA", "ATG", "ATG", "ATG", "ATG"
    ), Location = c("Eastern", "Manu'a", "Unorganized", "Western", 
    "Manu'a island", "just one region", "Barbuda", "Redonda", 
    "Saint George", "Saint John"), Latitude = c(NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_), Longitude = c(NA_real_, NA_real_, NA_real_, 
    NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, NA_real_, 
    NA_real_), Magnitude.value = c(310, 310, 310, 310, NA, NA, 
    NA, NA, NA, NA), Magnitude.scale = c("Kph", "Kph", "Kph", 
    "Kph", "Kph", "Kph", "Kph", "Kph", "Kph", "Kph"), Disaster.type = c("Storm", 
    "Storm", "Storm", "Storm", "Storm", "Storm", "Storm", "Storm", 
    "Storm", "Storm"), Disaster.subtype = c("Tropical cyclone", 
    "Tropical cyclone", "Tropical cyclone", "Tropical cyclone", 
    "Tropical cyclone", "Tropical cyclone", "Tropical cyclone", 
    "Tropical cyclone", "Tropical cyclone", "Tropical cyclone"
    ), Associated.disaster = c("--", "--", "--", "--", "--", 
    "--", "Flood", "Flood", "Flood", "Flood"), Associated.disaster2 = c("--", 
    "--", "--", "--", "--", "--", "--", "--", "--", "--"), Total.deaths = c(0L, 
    0L, 0L, 0L, 0L, 4L, 0L, 0L, 0L, 0L), Total.affected = c(23060L, 
    23060L, 23060L, 23060L, 0L, 15000L, 25800L, 25800L, 25800L, 
    25800L), Total.damage...000.US.. = c(150000L, 150000L, 150000L, 
    150000L, 0L, 200000L, 0L, 0L, 0L, 0L), Insured.losses...000.US.. = c(0, 
    0, 0, 0, 0, 6700, 0, 0, 0, 0), Disaster.name = c("Heta", 
    "Heta", "Heta", "Heta", "Olaf", "Hurricane 'Irma'", "Hurricane \"Omar\"", 
    "Hurricane \"Omar\"", "Hurricane \"Omar\"", "Hurricane \"Omar\""
    ), Disaster.No. = c("2004-0004", "2004-0004", "2004-0004", 
    "2004-0004", "2005-0151", "2017-0381", "2008-0604", "2008-0604", 
    "2008-0604", "2008-0604"), empty_region = c(0, 0, 0, 0, 0, 
    1, 0, 0, 0, 0), matched = c(NA, NA, NA, NA, NA, 1, NA, NA, 
    NA, NA)), .internal.selfref = <pointer: 0x7f835380dae0>, row.names = c(NA, 
-10L), class = c("data.table", "data.frame"))

标签: rstringcharacter

解决方案


不完全确定您想要实现的目标,但请注意:

disasters[ISO == j][grep(i, Location), Location := i]

'nothing'disasters[ISO == j]是否返回子集 data.table 但您不将其分配给任何变量,然后您[grep(i, Location), Location := i]对未分配给任何变量的对象执行操作。这不一样:

DT[some subseting, new_var := ...]
DT[some subseting][new_var := ...]

阅读 的Note部分?":="。所以尝试替换:

disasters[ISO == j][grep(i, Location), Location := i]
disasters[ISO == j & Location == i, matched := 1]

和:

disasters[ISO == j & str_detect(Location, i), ":="(Location = i, matched = 1)]

推荐阅读