首页 > 解决方案 > 在 R 中使用 trimws 函数后如何包含所有数据?

问题描述

10 'Referer URl' 的示例如下所示

https://www.google.com/ | query_string=utm_source=google&utm_medium=cpc&utm_campaign=121434112139&utm_term=&utm_content=Shirts&gclid=CXjadiOcHGGw6JEiJaf5zMhRxFk-AOtiXMOd_1szoBoCUEMQAvD_BwE | ip_address=123.21.62.57 | user_agent=Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:80.0) Gecko/20100101 Firefox/80.0
https://www.Type2online.com/ | query_string=null | ip_address=113.193.43.211 | user_agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36
https://www.google.com/ | query_string=gclid=CjwKCAjwh7H7BRBBEiwAPXjadn8fnPPR6HnqZrsK46JGDHKOo-C2jxHa1JW7V2glY_Lxi6sNo-AAdRoCDAcQAvD_BwE | ip_address=187.11.116.117 | user_agent=Mozilla/5.0 (Linux; Android 8.0.0; SM-C701F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36

Other URLs with no parameters are
https://m.facebook.com/
instagram.com
https://l.facebook.com
/https://www.google.com/
http://m.facebook.com


我正在使用下面的代码来分隔上述 URL 参数并为每个参数创建一个新列

Mydata$ref_url<-trimws(matrix(unlist(strsplit(as.character(Mydata$'Referer URL'),'|',fixed=TRUE)),ncol = 4, byrow = TRUE)[,1])

Mydata$query_string<-gsub("query_string=","",trimws(matrix(unlist(strsplit(as.character(Mydata$'Referer URL'),'|',fixed=TRUE)),ncol = 4, byrow = TRUE)[,2]))

Mydata$ip_address<-gsub("ip_address=","",trimws(matrix(unlist(strsplit(as.character(Mydata$'Referer URL'),'|',fixed=TRUE)),ncol = 4, byrow = TRUE)[,3]))

Mydata$user_agent<-gsub("user_agent=","",trimws(matrix(unlist(strsplit(as.character(Mydata$'Referer URL'),'|',fixed=TRUE)),ncol = 4, byrow = TRUE)[,4]))

使用这些功能中的每一个,我都会收到以下错误:

    Error: Assigned data `trimws(...)` must be compatible with existing data.
    x Existing data has 2645 rows.
    x Assigned data has 1096 rows.
    i Only vectors of size 1 are recycled.
    Run `rlang::last_error()` to see where the error occurred.
    In addition: Warning message:
    In matrix(unlist(strsplit(as.character(Mydata$"Referer URL"), "|",  :
      data length [4382] is not a sub-multiple or multiple of the number of rows [1096]

如何纠正这个问题?

标签: rtrimstrsplit

解决方案


tidyverse如果您可以保证所有参数具有相同的顺序,则使用以下代码给出所需的输出:

library(tidyverse)
ref %>% separate(V1, paste0("V",2:5), sep=" \\| ") -> separated
names(separated) <- c("url", gsub("=.+", "", separated[1,2:4]))
separated %>% mutate_all( ~ sub(".+?=","", .)) 
#>                            url                                                                                                                                          query_string     ip_address                                                                                                                    user_agent
#> 1      https://www.google.com/ utm_source=google&utm_medium=cpc&utm_campaign=121434112139&utm_term=&utm_content=Shirts&gclid=CXjadiOcHGGw6JEiJaf5zMhRxFk-AOtiXMOd_1szoBoCUEMQAvD_BwE   123.21.62.57                                            Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:80.0) Gecko/20100101 Firefox/80.0
#> 2 https://www.Type2online.com/                                                                                                                                                  null 113.193.43.211           Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.102 Safari/537.36
#> 3      https://www.google.com/                                                     gclid=CjwKCAjwh7H7BRBBEiwAPXjadn8fnPPR6HnqZrsK46JGDHKOo-C2jxHa1JW7V2glY_Lxi6sNo-AAdRoCDAcQAvD_BwE 187.11.116.117 Mozilla/5.0 (Linux; Android 8.0.0; SM-C701F) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.110 Mobile Safari/537.36
#> 4      https://m.facebook.com/                                                                                                                                                  <NA>           <NA>                                                                                                                          <NA>
#> 5                instagram.com                                                                                                                                                  <NA>           <NA>                                                                                                                          <NA>
#> 6       https://l.facebook.com                                                                                                                                                  <NA>           <NA>                                                                                                                          <NA>
#> 7     /https://www.google.com/                                                                                                                                                  <NA>           <NA>                                                                                                                          <NA>
#> 8        http://m.facebook.com                                                                                                                                                  <NA>           <NA>                                                                                                                          <NA>


推荐阅读