首页 > 解决方案 > 基于另一个数据帧映射从不同数据帧中提取数据

问题描述

我正在尝试根据 R 中的某些类别数据进行列表拆分。

我有这个数据:

# A tibble: 5 x 2
  category to_split  
  <chr>    <chr>     
1 cat12    c(1, 5)   
2 cat22    c(2, 5, 1)
3 cat33    3         
4 cat43    4         
5 cat51    c(5, 2)

其中c(1, 5)对应于该数据中的第 1 行和第 5 行。c(2, 5, 1)对应于第 2 行、第 5 行和第 1 行。

我还有第二个更大的数据框,如下所示:

# A tibble: 100 x 4
# Groups:   station_location [5]
   category  var1  var2  var3
   <chr>    <dbl> <dbl> <dbl>
 1 cat12        7   0.4    10
 2 cat12       20   1.1   155
 3 cat12       12   0.4     3
 4 cat12        4   0.3    38
 5 cat12       13   0.4    40
 6 cat12        7   0.3    17
 7 cat12        9   0.4    45
 8 cat12        3   0.3    17
 9 cat12        8   0.5    84
10 cat12       32   2.6   378
# ... with 90 more rows

我想创建新列表,以便从大数据框中c(1, 5)提取cat12和数据。cat51c(2, 5, 1)提取cat22,cat51cat12数据并将此数据存储在数据框中(在列表中)。

我想得到列表结构如下:

list(
     c(1, 5)  - a data frame containing the two corresponding categories of data
     c(2, 5, 1) - a data frame contained the three corresponding categories of data
     3
     4
     c(5, 2)
)

(我不关心列表名称c(1, 5)...等)。我将它们命名为数据来自哪个类别(在小数据框中),即

list(
     cat12  - data frame containing the two corresponding categories
     cat22
     cat33
     cat43
     cat51
)

我试图通过较小数据帧中的映射从较大数据帧中提取相关数据。

数据:

data_join <- structure(list(category = c("cat12", "cat22", "cat33", "cat43", 
"cat51"), to_split = c("c(1, 5)", "c(2, 5, 1)", "3", "4", "c(5, 2)"
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-5L))


full_data <- structure(list(category = c("cat12", "cat12", "cat12", "cat12", 
"cat12", "cat12", "cat12", "cat12", "cat12", "cat12", "cat12", 
"cat12", "cat12", "cat12", "cat12", "cat12", "cat12", "cat12", 
"cat12", "cat12", "cat22", "cat22", "cat22", "cat22", "cat22", 
"cat22", "cat22", "cat22", "cat22", "cat22", "cat22", "cat22", 
"cat22", "cat22", "cat22", "cat22", "cat22", "cat22", "cat22", 
"cat22", "cat33", "cat33", "cat33", "cat33", "cat33", "cat33", 
"cat33", "cat33", "cat33", "cat33", "cat33", "cat33", "cat33", 
"cat33", "cat33", "cat33", "cat33", "cat33", "cat33", "cat33", 
"cat43", "cat43", "cat43", "cat43", "cat43", "cat43", "cat43", 
"cat43", "cat43", "cat43", "cat43", "cat43", "cat43", "cat43", 
"cat43", "cat43", "cat43", "cat43", "cat43", "cat43", "cat51", 
"cat51", "cat51", "cat51", "cat51", "cat51", "cat51", "cat51", 
"cat51", "cat51", "cat51", "cat51", "cat51", "cat51", "cat51", 
"cat51", "cat51", "cat51", "cat51", "cat51"), var1 = c(7, 20, 
12, 4, 13, 7, 9, 3, 8, 32, 5, 2, 14, 7, 11, 9, 25, 5, 6, 18, 
14, 12, 11, 11, 5, 7, 12, 2, 7, 7, 5, 28, 6, 8, 4, 9, 4, 11, 
6, 5, NA, NA, 24, 6, 6, 29, NA, 11, NA, NA, NA, 9, NA, 8, 7, 
NA, 17, 6, NA, 6, NA, NA, NA, NA, NA, NA, NA, NA, 13, NA, NA, 
NA, NA, 16, 7, 8, NA, NA, 10, 19, 6, 10, 3, 12, 2, 2, 7, 11, 
5, 5, 6, 3, 6, 9, 11, 11, 12, 5, 14, 5), var2 = c(0.4, 1.1, 0.4, 
0.3, 0.4, 0.3, 0.4, 0.3, 0.5, 2.6, 0.6, 0.3, 0.5, 0.4, 0.4, 0.7, 
0.5, 0.3, 0.4, 0.6, 0.5, 0.3, 0.4, 0.2, 0.4, 0.5, 0.5, 0.3, 0.4, 
0.3, 0.4, 1.1, 0.4, 0.5, 0.2, 0.5, 0.4, 0.5, 0.6, 0.6, NA, NA, 
0.7, 0.1, 0.3, 0.5, NA, 0.7, NA, NA, NA, 0.2, NA, 0.3, 0.2, NA, 
0.3, 0.3, NA, 0.1, 0.2, 0.2, 0.5, 0.4, 0.3, 0.4, 0.2, 0.4, 0.3, 
0.3, 0.2, 0.3, 0.2, 0.4, 0.2, 0.2, 0.3, 0.3, 0.5, 0.5, 0.4, 0.2, 
0.3, 0.7, 0.3, 0.1, 0.3, 0.3, 0.4, 0.6, 0.3, 0.2, 0.4, 0.6, 0.2, 
0.7, 0.6, 0.4, 0.6, 0.5), var3 = c(10, 155, 3, 38, 40, 17, 45, 
17, 84, 378, 44, 14, 36, 20, 17, 76, 25, 4, 22, 63, 42, 23, 12, 
10, 15, 29, 26, 7, 18, 5, 23, 204, 24, 56, 7, 35, 23, 55, 28, 
65, 10, 13, 54, 13, 22, 45, 29, 58, 49, 14, 2, 9, 15, 38, 41, 
63, 11, 9, 7, 20, 3, 5, 52, 7, 18, 25, 2, 30, 10, 3, 3, 13, 1, 
12, 7, 5, 5, 9, 13, 4, 14, 9, 8, 147, 5, 7, 2, 10, 6, 66, 2, 
8, 6, 3, 8, 5, 45, 6, 20, 27)), class = c("grouped_df", "tbl_df", 
"tbl", "data.frame"), row.names = c(NA, -100L), groups = structure(list(
    station_location = c("cat12", "cat22", "cat33", "cat43", 
    "cat51"), .rows = list(1:20, 21:40, 41:60, 61:80, 81:100)), row.names = c(NA, 
-5L), class = c("tbl_df", "tbl", "data.frame"), .drop = TRUE))

标签: r

解决方案


目前尚不清楚为什么类别和拆分是在同一个数据框中定义的。如果每个拆分是五个类别的子集,则至少有一个类别有 31 = 2^5 - 1 个可能的拆分。

# Give each category an id
cats <- data_join %>%
  select(category) %>%
  mutate(
    .id = row_number()
  )
head(cats)
#> # A tibble: 5 x 2
#>   category   .id
#>   <chr>    <int>
#> 1 cat12        1
#> 2 cat22        2
#> 3 cat33        3
#> 4 cat43        4
#> 5 cat51        5

# Figure out which category goes into each split/subset (Could have more/fewer than 5?)
subsets <- data_join  %>%
  select(to_split) %>%
  mutate(.id = map(to_split, ~ eval(parse(text = .)))) %>%
  unnest(.id) %>%
  inner_join(cats, by = ".id")
head(subsets)
#> # A tibble: 6 x 3
#>   to_split     .id category
#>   <chr>      <dbl> <chr>   
#> 1 c(1, 5)        1 cat12   
#> 2 c(1, 5)        5 cat51   
#> 3 c(2, 5, 1)     2 cat22   
#> 4 c(2, 5, 1)     5 cat51   
#> 5 c(2, 5, 1)     1 cat12   
#> 6 3              3 cat33

# Now you can join the subsets with the data
df <- full_data %>%
  inner_join(
    subsets, by = "category"
  )

# And split by group
splits <- split(df, df$to_split)

names(splits)
#> [1] "3"          "4"          "c(1, 5)"    "c(2, 5, 1)" "c(5, 2)"

reprex 包(v0.3.0)于 2019 年 11 月 1 日创建


推荐阅读