首页 > 解决方案 > 提取向量中的所有值

问题描述

经过大量研究后,我无法找到在不知道键的情况下从 R 中的数据框中提取值的解决方案

我有一个名为 test 的数据框,它有一个公共向量(具有一些 json 值)和一个向量作为 id。记录以百万计

例子

id  common
1   {ename=pageload, pgloc={from=https://m.amazon.com/gift/popular-aloe-vera-plant?gclid=CjwKCAjw8MD7BRArEiwAGZsrBZh6cWJ1-PGvFC1zMutwfjBJuGROHhW4l_ZtcH3n2ZvPSotsTO-sgxoCucAQAvD_BwE, to=https://m.amazon.com/gift/popular-aloe-vera-plant?gclid=CjwKCAjw8MD7BRArEiwAGZsrBZh6cWJ1-PGvFC1zMutwfjBJuGROHhW4l_ZtcH3n2ZvPSotsTO-sgxoCucAQAvD_BwE#/product-page, clikd=}, dev={ver=1.0, blang=en-GB, ip=27.5.192.167, dtype=Mobile, ua=Mozilla/5.0 (Linux; Android 10; HD1901) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/85.0.4183.127 Mobile Safari/537.36, did=bc0a2740-d89e-11ea-b35c-567d21f9cbe3, appos=Android, appversion=null, model=null, osver=null, brand=null, pfspec=web}, fngid=81f8036d1099bd59ff93454d1f8, dname=fnp.com, user={cur=, id=sufeb@gmail.com}, wegid=null}  

2 {ename=pageload, pgloc={from=https://www.google.com/, to=https://m.amazon.com/gift/red-velvet-fresh-cream-cake?gclid=CjwKCAjw8MD7BRArEiwAGZsrBd464AGGzOLMzzaxggCPNU-onDOZuhUqzz3tB6UOIUneNq6rcduxUxoCjXwQAvD_BwE#/product-page, clikd=}, dev={ver=1.0, blang=en-US, ip=106.217.118.179, dtype=Mobile, ua=Mozilla/5.0 (Linux; Android 8.1.0; vivo 1724) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.92 Mobile Safari/537.36, did=d43e7274-8116-11ea-96ae-b627f142e667, appos=Android, appversion=null, model=null, osver=null, brand=null, pfspec=web}, fngid=aeb8d109630f797980ac4cc4066d4c4b, dname=fnp.com, user={cur=, id=}, wegid=null}  
I want to extract all of the values in separate columns in same data frame. Example for id 1 user={cur=, id=sufeb@gmail.com} has values where in id 2 user={cur=, id=} has no values. Hence i would need all values irrespective of keys

试过这个但不起作用

library(tidyverse)
library(rjson)

extract_json_column <- function(test){
  test %>%
    rowwise() %>%
    mutate(data = map(common, fromJSON)) %>%
    split(.$index) %>%
    map(~.$common[[1]]) %>%
    map(~map_if(., function(x) length(x) != 1, list)) %>%
    map(as_data_frame) %>%
    bind_rows(.id = "index")
}

df <- do.call(plyr::rbind.fill, lapply(test[test != ""], function(x) as.data.frame(t(unlist(fromJSON(x))))))

错误信息

 Error in fromJSON(content, handler, default.size, depth, allowComments,  : 
  invalid JSON input 

按键 ename 和 did 上的示例输出。但是需要所有可能的键,例如 fngid,dtype,而无需键入所有键

id   ename      did  
1  pageload     bc0a2740-d89e-11ea-b35c-567d21f9cbe3  
2  pageload     d43e7274-8116-11ea-96ae-b627f142e667  

链接到数据样本 https://docs.google.com/spreadsheets/d/17Gsrq-BCKqRA6csQdQH0duK4olc39TqmnbDYGExhXqo/edit#gid=0

链接到 gdrive https://drive.google.com/file/d/1TxSY97DRT8ih2VJENolar3MJXwtSFTGX/view?usp=sharing

标签: r

解决方案


您可以在基础 R 中编写一个函数:

convert_to_json_string <- function(x){
  key_update<-gsub("(?<=[{]|, )(\\w+)=",'"\\1"=', x, perl = TRUE)
  update_vals <- gsub('(?<="=)([^{}"]+)(?=(, ")|[}])','"\\1"', key_update, perl=TRUE)
  update_empty_vals<-gsub('(?<="=)(?=[,}])','""',update_vals, perl=TRUE)
  sprintf("[%s]", toString(gsub('"=', '":', update_empty_vals)))
}

然后做:

jsonlite::fromJSON(convert_to_json_string(test$common))

结果太大,无法在此处发布,但结构如下:

str(jsonlite::fromJSON(convert_to_json_string(test$common)))
'data.frame':   10 obs. of  7 variables:
 $ ename: chr  "pageload" "pageload" "pageload" "pageload" ...
 $ pgloc:'data.frame':  10 obs. of  3 variables:
  ..$ from : chr  "https://m.amazon.com/gift/popular-aloe-vera-plant?gclid=CjwKCAjw8MD7BRArEiwAGZsrBZh6cWJ1-PGvFC1zMutwfjBJuGROHhW"| __truncated__ "https://www.google.com/" "plp~designer-cakes" "" ...
  ..$ to   : chr  "https://m.amazon.com/gift/popular-aloe-vera-plant?gclid=CjwKCAjw8MD7BRArEiwAGZsrBZh6cWJ1-PGvFC1zMutwfjBJuGROHhW"| __truncated__ "https://m.amazon.com/gift/red-velvet-fresh-cream-cake?gclid=CjwKCAjw8MD7BRArEiwAGZsrBd464AGGzOLMzzaxggCPNU-onDO"| __truncated__ "home" "https://www.amazon.com/" ...
  ..$ clikd: chr  "" "" "back" "" ...
 $ dev  :'data.frame':  10 obs. of  12 variables:
  ..$ ver       : chr  "1.0" "1.0" "1.0" "1.0" ...
  ..$ blang     : chr  "en-GB" "en-US" "en-US" "en-US" ...
  ..$ ip        : chr  "27.5.192.167" "106.217.118.179" "223.237.31.203" "103.240.195.254" ...
  ..$ dtype     : chr  "Mobile" "Mobile" "Mobile" "De

推荐阅读