r - 在 R 中的字符串中找到 2 个单词的美国州
问题描述
这很简单,但我已经为此苦苦挣扎了太久了。我有几个美国各州的单词沙拉字符串列表,我将它们合并成一个整洁的数据框。其中一部分是识别字符串中的美国各州。我找不到识别两个单词状态的方法(例如“纽约”)-我能够找到单个单词状态(例如“佛罗里达”)。你能帮我识别这些词吗?我最接近的代码如下。
我需要得到与输出相同的字符串。唯一的区别是两个名称的州之间用下划线分隔(例如“New_York”)。
library(tidyverse)
search_string <- " Stamps\nNevada 61,455 82,713 12,832 95,545 $ 1,670,735 $ 1,461,634 $ 3,132,369\nNew Hampshire 67,586 194,207 39,225 233,432 $ 2,287,792 $ 1,372,421 $ 3,660,213\nNew Jersey 82,814 282,527 146,678 429,205 $ 6,335,263 $ 2,813,593 $ 9,148,856\nNew Mexico 111,188 379,489 81,056 460,545 $ 4,653,064 $ 8,789,532 $ 13,442,596\nNew York 696,679 679,458 74,731 754,189 $ 13,193,942 $ 5,298,613 $ 18,492,555\nNorth Carolina 433,135 471,648 24,260 495,908 $ 8,446,725 $ 1,203,040 $ 9,649,765\nNorth Dakota 141,816 413,234 162,252 575,486 $ 3,526,114 $ 4,310,924 $ 7,837,038\nOhio 426,856 1,068,917 17,723 1,086,640 $ 15,007,107 $ 1,396,546 $ 16,403,653\nOklahoma 330,336 334,251 14,673 348,924 $ 5,849,527 $ 1,277,809 $ 7,127,336\nOregon 297,944 1,344,799 64,439 1,409,238 $ 14,306,510 $ 4,298,684 $ 18,605,194\nPennsylvania 1,048,731 2,398,471 122,202 2,520,673 $ 30,601,457 $ 8,181,893 $ 38,783,350\nRhode Island 10,750 29,270 3,553 32,823 $ 216,706 $ 79,868 $ 296,574\nSouth Carolina 279,203 207,379 58,527 265,906 $ 3,241,468 $ 4,117,974 $ 7,359,442\nSouth Dakota 216,152 247,222 100,706 347,928 $ 5,588,964 $ 9,431,150 $ 15,020,114\nTennessee 725,110 1,094,149 37,301 1,131,450 $ 11,555,825 $ 1,855,572 $ 13,411,397\nTexas 1,027,908 1,205,905 60,198 1,266,103 $ 19,675,334 $ 6,764,564 $ 26,439,898\nUtah 159,678 217,128 13,025 230,153 $ 7,399,301 $ 2,826,440 $ 10,225,741\nVermont 92,138 168,989 23,319 192,308 $ 2,461,500 $ 1,250,190 $ 3,711,690\nVirginia 314,748 774,910 48,213 823,123 $ 8,800,321 $ 2,369,762 $ 11,170,083\nWashington 198,162 780,794 10,718 791,512 $ 10,837,451 $ 744,633 $ 11,582,084\nWest Virginia 288,098 656,091 174,657 830,748 $ 4,831,265 $ 6,227,285 $ 11,058,550\nWisconsin 689,099 2,472,489 127,017 2,599,506 $ 24,942,778 $ 6,534,212 $ 31,476,990\nWyoming 137,608 165,464 75,434 240,898 $ 4,258,947 $ 15,242,063 $ 19,501,010\nTotal 14,966,406 31,340,988 2,846,854 34,187,842 $ 412,251,767 $ 246,742,031 $ 658,993,797\nU.S. Territories & DC"
search_string %>%
str_squish() %>%
str_subset('\\bWest Virginia\\b')
~ 编辑
使用 dplyr 的一种方法
search_string %>%
str_squish() %>%
str_split(' ') %>%
flatten_chr() %>%
as_tibble() %>%
mutate(lead = lead(value)) %>%
mutate(alfa = case_when(
str_detect(value,
glue_collapse(
c('South',
'North',
'New',
'West',
'Rhode'),
sep = '|')) ~ glue('{value}_{lead}'),
T ~ value
)) %>%
pull(alfa)
解决方案
清理
gsub()
我们排除所有数字、s$
和逗号。然后我们拆分换行符\n
并用 . 去掉多余的空格str_squish()
。
a <- gsub("[0-9|\\$,]", " ", search_string) %>%
strsplit("\n", fixed = TRUE) %>%
.[[1]] %>%
str_squish()
现在我们有了所有状态的向量
a
#> [1] "Stamps" "Nevada" "New Hampshire"
#> [4] "New Jersey" "New Mexico" "New York"
#> [7] "North Carolina" "North Dakota" "Ohio"
#> [10] "Oklahoma" "Oregon" "Pennsylvania"
#> [13] "Rhode Island" "South Carolina" "South Dakota"
#> [16] "Tennessee" "Texas" "Utah"
#> [19] "Vermont" "Virginia" "Washington"
#> [22] "West Virginia" "Wisconsin" "Wyoming"
#> [25] "Total" "U.S. Territories & DC"
我们可以通过选择其中有空格的状态来获得超过字母的状态grep()
。
b <- a[grep(" ", a)]
b
#> [1] "New Hampshire" "New Jersey" "New Mexico"
#> [4] "New York" "North Carolina" "North Dakota"
#> [7] "Rhode Island" "South Carolina" "South Dakota"
#> [10] "West Virginia" "U.S. Territories & DC"
用下划线替换两个单词状态中的空格
我们创建一个包含替换字符串的字符串向量并用于mgsub::mgsub()
进行替换。
c <- gsub(" ", "_", b)
c
#> [1] "New_Hampshire" "New_Jersey" "New_Mexico"
#> [4] "New_York" "North_Carolina" "North_Dakota"
#> [7] "Rhode_Island" "South_Carolina" "South_Dakota"
#> [10] "West_Virginia" "U.S._Territories_&_DC"
library(mgsub)
mgsub(search_string, b, c)
#> [1] " Stamps\nNevada 61,455 82,713 12,832 95,545 $ 1,670,735 $ 1,461,634 $ 3,132,369\nNew_Hampshire 67,586 194,207 39,225 233,432 $ 2,287,792 $ 1,372,421 $ 3,660,213\nNew_Jersey 82,814 282,527 146,678 429,205 $ 6,335,263 $ 2,813,593 $ 9,148,856\nNew_Mexico 111,188 379,489 81,056 460,545 $ 4,653,064 $ 8,789,532 $ 13,442,596\nNew_York 696,679 679,458 74,731 754,189 $ 13,193,942 $ 5,298,613 $ 18,492,555\nNorth_Carolina 433,135 471,648 24,260 495,908 $ 8,446,725 $ 1,203,040 $ 9,649,765\nNorth_Dakota 141,816 413,234 162,252 575,486 $ 3,526,114 $ 4,310,924 $ 7,837,038\nOhio 426,856 1,068,917 17,723 1,086,640 $ 15,007,107 $ 1,396,546 $ 16,403,653\nOklahoma 330,336 334,251 14,673 348,924 $ 5,849,527 $ 1,277,809 $ 7,127,336\nOregon 297,944 1,344,799 64,439 1,409,238 $ 14,306,510 $ 4,298,684 $ 18,605,194\nPennsylvania 1,048,731 2,398,471 122,202 2,520,673 $ 30,601,457 $ 8,181,893 $ 38,783,350\nRhode_Island 10,750 29,270 3,553 32,823 $ 216,706 $ 79,868 $ 296,574\nSouth_Carolina 279,203 207,379 58,527 265,906 $ 3,241,468 $ 4,117,974 $ 7,359,442\nSouth_Dakota 216,152 247,222 100,706 347,928 $ 5,588,964 $ 9,431,150 $ 15,020,114\nTennessee 725,110 1,094,149 37,301 1,131,450 $ 11,555,825 $ 1,855,572 $ 13,411,397\nTexas 1,027,908 1,205,905 60,198 1,266,103 $ 19,675,334 $ 6,764,564 $ 26,439,898\nUtah 159,678 217,128 13,025 230,153 $ 7,399,301 $ 2,826,440 $ 10,225,741\nVermont 92,138 168,989 23,319 192,308 $ 2,461,500 $ 1,250,190 $ 3,711,690\nVirginia 314,748 774,910 48,213 823,123 $ 8,800,321 $ 2,369,762 $ 11,170,083\nWashington 198,162 780,794 10,718 791,512 $ 10,837,451 $ 744,633 $ 11,582,084\nWest_Virginia 288,098 656,091 174,657 830,748 $ 4,831,265 $ 6,227,285 $ 11,058,550\nWisconsin 689,099 2,472,489 127,017 2,599,506 $ 24,942,778 $ 6,534,212 $ 31,476,990\nWyoming 137,608 165,464 75,434 240,898 $ 4,258,947 $ 15,242,063 $ 19,501,010\nTotal 14,966,406 31,340,988 2,846,854 34,187,842 $ 412,251,767 $ 246,742,031 $ 658,993,797\nU.S._Territories_&_DC"
由reprex 包(v0.3.0)于 2020-11-06 创建
推荐阅读
- java - 如何为多个用户隔离 Jetty HttpClient?
- javascript - 如何将 Babel Standalone 与 Flow 一起使用
- csv - 如何为 CSV 文件中的逗号分隔列表格式化 CSV 文件
- c - 如何获得准确的输出“数学错误”?
- css - 使用 CSS3 渐变时出现“未知属性值”错误
- flutter - 使两个文本字段始终对齐
- python - 如何在 QT 5.6.1 中使用 QSortFilterProxyModel 执行递归过滤器搜索?
- python - Yfinance KeyError:'regularMarketOpen'
- python - 有没有办法在bs4中刮掉一个div id?
- html - 无法在响应式设计中将两个 div 放在单行中