首页 > 解决方案 > 在 R 中的字符串中找到 2 个单词的美国州

问题描述

这很简单,但我已经为此苦苦挣扎了太久了。我有几个美国各州的单词沙拉字符串列表,我将它们合并成一个整洁的数据框。其中一部分是识别字符串中的美国各州。我找不到识别两个单词状态的方法(例如“纽约”)-我能够找到单个单词状态(例如“佛罗里达”)。你能帮我识别这些词吗?我最接近的代码如下。

我需要得到与输出相同的字符串。唯一的区别是两个名称的州之间用下划线分隔(例如“New_York”)。

library(tidyverse)

search_string <- "                                        Stamps\nNevada                         61,455           82,713           12,832            95,545 $      1,670,735  $          1,461,634  $       3,132,369\nNew Hampshire                  67,586          194,207           39,225          233,432  $      2,287,792  $          1,372,421  $       3,660,213\nNew Jersey                     82,814          282,527          146,678          429,205  $      6,335,263  $          2,813,593  $       9,148,856\nNew Mexico                   111,188           379,489           81,056          460,545  $      4,653,064  $          8,789,532  $      13,442,596\nNew York                     696,679           679,458           74,731          754,189  $     13,193,942  $          5,298,613  $      18,492,555\nNorth Carolina               433,135           471,648           24,260          495,908  $      8,446,725  $          1,203,040  $       9,649,765\nNorth Dakota                 141,816           413,234          162,252          575,486  $      3,526,114  $          4,310,924  $       7,837,038\nOhio                         426,856         1,068,917           17,723        1,086,640  $     15,007,107  $          1,396,546  $      16,403,653\nOklahoma                     330,336           334,251           14,673          348,924  $      5,849,527  $          1,277,809  $       7,127,336\nOregon                       297,944         1,344,799           64,439        1,409,238  $     14,306,510  $          4,298,684  $      18,605,194\nPennsylvania               1,048,731         2,398,471          122,202        2,520,673  $     30,601,457  $          8,181,893  $      38,783,350\nRhode Island                   10,750           29,270            3,553            32,823 $         216,706 $              79,868 $         296,574\nSouth Carolina               279,203           207,379           58,527          265,906  $      3,241,468  $          4,117,974  $       7,359,442\nSouth Dakota                 216,152           247,222          100,706          347,928  $      5,588,964  $          9,431,150  $      15,020,114\nTennessee                    725,110         1,094,149           37,301        1,131,450  $     11,555,825  $          1,855,572  $      13,411,397\nTexas                      1,027,908         1,205,905           60,198        1,266,103  $     19,675,334  $          6,764,564  $      26,439,898\nUtah                         159,678           217,128           13,025          230,153  $      7,399,301  $          2,826,440  $      10,225,741\nVermont                        92,138          168,989           23,319          192,308  $      2,461,500  $          1,250,190  $       3,711,690\nVirginia                     314,748           774,910           48,213          823,123  $      8,800,321  $          2,369,762  $      11,170,083\nWashington                   198,162           780,794           10,718          791,512  $     10,837,451  $             744,633 $      11,582,084\nWest Virginia                288,098           656,091          174,657          830,748  $      4,831,265  $          6,227,285  $      11,058,550\nWisconsin                    689,099         2,472,489          127,017        2,599,506  $     24,942,778  $          6,534,212  $      31,476,990\nWyoming                      137,608           165,464           75,434          240,898  $      4,258,947  $         15,242,063  $      19,501,010\nTotal                     14,966,406       31,340,988         2,846,854       34,187,842 $     412,251,767  $        246,742,031  $     658,993,797\nU.S. Territories & DC"


 
search_string %>% 
  str_squish() %>% 
    str_subset('\\bWest Virginia\\b')

~ 编辑

使用 dplyr 的一种方法

search_string %>% 
  str_squish() %>% 
  str_split(' ') %>% 
  flatten_chr() %>% 
  as_tibble() %>% 
  mutate(lead = lead(value)) %>% 
  mutate(alfa = case_when(
    str_detect(value, 
               glue_collapse(
                 c('South',
                   'North', 
                   'New', 
                   'West', 
                   'Rhode'),
                 sep = '|')) ~ glue('{value}_{lead}'), 
    T ~ value
  )) %>% 
  pull(alfa)

标签: rstringr

解决方案


清理

gsub()我们排除所有数字、s$和逗号。然后我们拆分换行符\n并用 . 去掉多余的空格str_squish()

a <- gsub("[0-9|\\$,]", " ", search_string) %>% 
  strsplit("\n", fixed = TRUE) %>% 
  .[[1]] %>% 
  str_squish()

现在我们有了所有状态的向量

a
#>  [1] "Stamps"                "Nevada"                "New Hampshire"        
#>  [4] "New Jersey"            "New Mexico"            "New York"             
#>  [7] "North Carolina"        "North Dakota"          "Ohio"                 
#> [10] "Oklahoma"              "Oregon"                "Pennsylvania"         
#> [13] "Rhode Island"          "South Carolina"        "South Dakota"         
#> [16] "Tennessee"             "Texas"                 "Utah"                 
#> [19] "Vermont"               "Virginia"              "Washington"           
#> [22] "West Virginia"         "Wisconsin"             "Wyoming"              
#> [25] "Total"                 "U.S. Territories & DC"

我们可以通过选择其中有空格的状态来获得超过字母的状态grep()

b <- a[grep(" ", a)]
b
#>  [1] "New Hampshire"         "New Jersey"            "New Mexico"           
#>  [4] "New York"              "North Carolina"        "North Dakota"         
#>  [7] "Rhode Island"          "South Carolina"        "South Dakota"         
#> [10] "West Virginia"         "U.S. Territories & DC"

用下划线替换两个单词状态中的空格

我们创建一个包含替换字符串的字符串向量并用于mgsub::mgsub()进行替换。

c <- gsub(" ", "_", b)
c
#>  [1] "New_Hampshire"         "New_Jersey"            "New_Mexico"           
#>  [4] "New_York"              "North_Carolina"        "North_Dakota"         
#>  [7] "Rhode_Island"          "South_Carolina"        "South_Dakota"         
#> [10] "West_Virginia"         "U.S._Territories_&_DC"

library(mgsub)
mgsub(search_string, b, c)
#> [1] "                                        Stamps\nNevada                         61,455           82,713           12,832            95,545 $      1,670,735  $          1,461,634  $       3,132,369\nNew_Hampshire                  67,586          194,207           39,225          233,432  $      2,287,792  $          1,372,421  $       3,660,213\nNew_Jersey                     82,814          282,527          146,678          429,205  $      6,335,263  $          2,813,593  $       9,148,856\nNew_Mexico                   111,188           379,489           81,056          460,545  $      4,653,064  $          8,789,532  $      13,442,596\nNew_York                     696,679           679,458           74,731          754,189  $     13,193,942  $          5,298,613  $      18,492,555\nNorth_Carolina               433,135           471,648           24,260          495,908  $      8,446,725  $          1,203,040  $       9,649,765\nNorth_Dakota                 141,816           413,234          162,252          575,486  $      3,526,114  $          4,310,924  $       7,837,038\nOhio                         426,856         1,068,917           17,723        1,086,640  $     15,007,107  $          1,396,546  $      16,403,653\nOklahoma                     330,336           334,251           14,673          348,924  $      5,849,527  $          1,277,809  $       7,127,336\nOregon                       297,944         1,344,799           64,439        1,409,238  $     14,306,510  $          4,298,684  $      18,605,194\nPennsylvania               1,048,731         2,398,471          122,202        2,520,673  $     30,601,457  $          8,181,893  $      38,783,350\nRhode_Island                   10,750           29,270            3,553            32,823 $         216,706 $              79,868 $         296,574\nSouth_Carolina               279,203           207,379           58,527          265,906  $      3,241,468  $          4,117,974  $       7,359,442\nSouth_Dakota                 216,152           247,222          100,706          347,928  $      5,588,964  $          9,431,150  $      15,020,114\nTennessee                    725,110         1,094,149           37,301        1,131,450  $     11,555,825  $          1,855,572  $      13,411,397\nTexas                      1,027,908         1,205,905           60,198        1,266,103  $     19,675,334  $          6,764,564  $      26,439,898\nUtah                         159,678           217,128           13,025          230,153  $      7,399,301  $          2,826,440  $      10,225,741\nVermont                        92,138          168,989           23,319          192,308  $      2,461,500  $          1,250,190  $       3,711,690\nVirginia                     314,748           774,910           48,213          823,123  $      8,800,321  $          2,369,762  $      11,170,083\nWashington                   198,162           780,794           10,718          791,512  $     10,837,451  $             744,633 $      11,582,084\nWest_Virginia                288,098           656,091          174,657          830,748  $      4,831,265  $          6,227,285  $      11,058,550\nWisconsin                    689,099         2,472,489          127,017        2,599,506  $     24,942,778  $          6,534,212  $      31,476,990\nWyoming                      137,608           165,464           75,434          240,898  $      4,258,947  $         15,242,063  $      19,501,010\nTotal                     14,966,406       31,340,988         2,846,854       34,187,842 $     412,251,767  $        246,742,031  $     658,993,797\nU.S._Territories_&_DC"

reprex 包(v0.3.0)于 2020-11-06 创建


推荐阅读