首页 > 解决方案 > 匹配包含字符串模式的 URL,并将 URL 保存在 R 数据框中的新列中

问题描述

我有一个数据集,其中包含多个 url 作为一列“urls”中的字符串

urls <- "https://www.linkedin.com/xx/xxx-xx-xxx/ https//domain.io https://medium.com/@xxxxx"
id <- 1

df <- cbind(data.frame(urls), data.frame(id))

我现在想提取与“linkedin.com”匹配的完整域并将其存储在新列 df$linkedin 中。并对匹配“medium.com”的域执行相同操作,并将其存储在新列 df$medium 中。所以结果基本上是

df$linkedin
"https://www.linkedin.com/xx/xxx-xx-xxx/"

df$medium
"https://medium.com/@xxxxx"

不知何故,我今天的发型很糟糕,没有看到一个优雅的解决方案。如果你能在这里帮助我,那就太棒了:)

标签: rregex

解决方案


我将通过将其设置为两行来使其更有趣:

df2 <- structure(list(urls = c("https://www.linkedin.com/xx/xxx-xx-xxx/ https//domain.io https://medium.com/@xxxxx", "https://www.linkedin.com/yy/yyy-yy-yyy/ https//domain.io https://medium.com/@yyyyy"), id = c(1, 1)), row.names = c(NA, -2L), class = "data.frame")
df2
#                                                                                 urls id
# 1 https://www.linkedin.com/xx/xxx-xx-xxx/ https//domain.io https://medium.com/@xxxxx  1
# 2 https://www.linkedin.com/yy/yyy-yy-yyy/ https//domain.io https://medium.com/@yyyyy  1

碱基R

baseurls <- c("linkedin", "medium")
newcols <- lapply(setNames(nm = baseurls), function(U) unlist(regmatches(df2$urls, gregexpr(paste0("http[^ ]*", U, "[^ ]*"), df2$urls))))
newcols
# $linkedin
# [1] "https://www.linkedin.com/xx/xxx-xx-xxx/" "https://www.linkedin.com/yy/yyy-yy-yyy/"
# $medium
# [1] "https://medium.com/@xxxxx" "https://medium.com/@yyyyy"
cbind(df2, data.frame(newcols))
#                                                                                 urls id                                linkedin                    medium
# 1 https://www.linkedin.com/xx/xxx-xx-xxx/ https//domain.io https://medium.com/@xxxxx  1 https://www.linkedin.com/xx/xxx-xx-xxx/ https://medium.com/@xxxxx
# 2 https://www.linkedin.com/yy/yyy-yy-yyy/ https//domain.io https://medium.com/@yyyyy  1 https://www.linkedin.com/yy/yyy-yy-yyy/ https://medium.com/@yyyyy

tidyverse

## baseurls <- ...
library(dplyr)
library(stringr) # str_extract
library(purrr)   # map_dfc
map_dfc(setNames(nm = baseurls), ~ str_extract(df2$urls, paste0("http[^ ]*", .x, "[^ ]*"))) %>%
  bind_cols(df2, .)
#                                                                                 urls id                                linkedin                    medium
# 1 https://www.linkedin.com/xx/xxx-xx-xxx/ https//domain.io https://medium.com/@xxxxx  1 https://www.linkedin.com/xx/xxx-xx-xxx/ https://medium.com/@xxxxx
# 2 https://www.linkedin.com/yy/yyy-yy-yyy/ https//domain.io https://medium.com/@yyyyy  1 https://www.linkedin.com/yy/yyy-yy-yyy/ https://medium.com/@yyyyy

推荐阅读