首页 > 解决方案 > 将完整地址列拆分为多列

问题描述

我有一个具有以下列结构的数据框(总共超过 1000 行):

addressfull
POINT(3.124537653 32.179354012)||DEF_32||molengraaf 20, 1689 GL Utrecht, Netherlands||15||map
POINT(3.124537680 32.179354014)||DEF_32||winkellaan 67, 5788 BG Amsterdam, Netherlands||13||map
POINT(3.124537653 32.179354012)||DEF_32||vermeerstraat 18, 0932 DC Rotterdam, Netherlands||11||map
POINT(2.915206183 24.315583523)||DEF_32||--||13||map
POINT (2.900824999999923 34.3175721)||DEF_84||Zandhorstlaan 122, 0823 GT Ochtrup, Germany||17||map
structure(list(addressfull = structure(c(3L, 5L, 4L, 2L, 1L), .Label = c("POINT (2.900824999999923 34.3175721)||DEF_84||Zandhorstlaan 122, 0823 GT Ochtrup, Germany||17||map", 
"POINT(2.915206183 24.315583523)||DEF_32||--||13||map", "POINT(3.124537653 32.179354012)||DEF_32||molengraaf 20, 1689 GL Utrecht, Netherlands||15||map", 
"POINT(3.124537653 32.179354012)||DEF_32||vermeerstraat 18, 0932 DC Rotterdam, Netherlands||11||map", 
"POINT(3.124537680 32.179354014)||DEF_32||winkellaan 67, 5788 BG Amsterdam, Netherlands||13||map"
), class = "factor")), class = "data.frame", row.names = c(NA, 
-5L))

该列包含位置、街道、门牌号、邮政编码、城市和国家/地区。我想用 R 在多列中拆分列 addressfull,例如:

street        house number      zip       city      country
molengraaf    20                1689 GL   Utrecht   Netherlands
winkellaan    67                5788 BG   Amsterdam Netherlands
vermeerstraat 18                0932 DC   Rotterdam Netherlands
na            na                na        na        na
Zandhorstlaan 122               0823 GT   Ochtrup   Germany

我已经阅读了 tidyr 和 stringr 文档。我可以看到用于拆分(按“)”、“| 从位置 x”和“,”的模式。但我无法弄清楚将列拆分为多列的正确代码。

有人能帮我吗?

标签: rsplittidyrstringr

解决方案


sub您可以使用基本 R 方法强制使用它:

df$steet <- sub("^(\\S+)\\s+.*$", "\\1", df$adressfull)
df$`house number` <- sub("^\\S+\\s+(\\d+).*$", "\\1", df$adressfull)
df$zip <- sub("^\\S+\\s+\\d+,\\s*(\\d+\\s+[A-Z]+).*$", "\\1", df$adressfull)
df$city <- sub("^.*?(\\S+),\\s*\\S+$", "\\1", df$adressfull)
df$country <- sub("^.*,\\s*(\\S+)$", "\\1", df$adressfull)
df

                                   adressfull      steet house number     zip
1 molengraaf 20, 1689 GL Utrecht, Netherlands molengraaf           20 1689 GL
     city     country
1 Utrecht Netherlands

数据:

df <- data.frame(adressfull=c("molengraaf 20, 1689 GL Utrecht, Netherlands"),
                 stringsAsFactors=FALSE)

这假设我们已经隔离了地址文本。为此,请考虑:

text <- "POINT(3.124537653 32.179354012)||DEF_32||molengraaf 20, 1689 GL Utrecht, Netherlands||15||map"
addresfull <- unlist(strsplit(text, "\\|\\|"))[3]
addresfull

[1] "molengraaf 20, 1689 GL Utrecht, Netherlands"

推荐阅读