首页 > 解决方案 > 过滤包含向量的数据框列

问题描述

我想过滤包含单元格全部内容的向量的列。我看过R dplyr。过滤包含一列数字向量的数据框,但我的需要略有不同。

样本df(下面的完整代表)

df <- tibble::tribble(
    ~id, ~len, ~vec,
     1L,   1L,   1L,
     2L,   2L,   1:2,
     3L,   2L,   c(1L, 2L),
     4L,   3L,   c(1L, 2L, 3L),
     5L,   3L,   1:3,
     6L,   3L,   c(1L, 3L, 2L),
     7L,   3L,   c(3L, 2L, 1L),
     8L,   3L,   c(1L, 3L, 2L),
     9L,   4L,   c(1L, 2L, 4L, 3L),
    10L,   3L,   c(3L, 2L, 1L)
    )

给出(匹配的颜色编码)

df

我可以按vec列分组:

dfg <- df %>% 
    group_by(vec) %>% 
    summarise(n = n()
             ,total_len = sum(len))

分组

对于单个单元格,直接比较不起作用,但相同

df$vec[4] == df$vec[5]
#> Error in df$vec[4] == df$vec[5]: comparison of these types is not implemented

identical(df$vec[4], df$vec[5])
#> [1] TRUE

但是没有一个等价物在filter中起作用,这是我需要的:

df %>% filter(vec == c(1L, 2L, 3L))
#> Warning in vec == c(1L, 2L, 3L): longer object length is not a multiple of
#> shorter object length
#> Error: Problem with `filter()` input `..1`.
#> x 'list' object cannot be coerced to type 'integer'
#> i Input `..1` is `vec == c(1L, 2L, 3L)`.

df %>% filter(identical(vec, c(1L, 2L, 3L)))
#> # A tibble: 0 x 3
#> # ... with 3 variables: id <int>, len <int>, vec <list>

df %>% filter(identical(vec, vec[5]))
#> # A tibble: 0 x 3
#> # ... with 3 variables: id <int>, len <int>, vec <list>

我确定我缺少一个简单的解决方案。

更高级的需求是匹配单元格内容以任意顺序匹配的位置,因此上面突出显示的 6 个橙色、紫色和金色单元格将全部匹配。一个也适用于列表和向量的解决方案也很好,因为这可能是未来的需要。

完整的代表:

library(tibble)
library(dplyr)
#> 
#> Attaching package: 'dplyr'
#> The following objects are masked from 'package:stats':
#> 
#>     filter, lag
#> The following objects are masked from 'package:base':
#> 
#>     intersect, setdiff, setequal, union

df <- tibble::tribble(
    ~id, ~len, ~vec,
     1L,   1L,   1L,
     2L,   2L,   1:2,
     3L,   2L,   c(1L, 2L),
     4L,   3L,   c(1L, 2L, 3L),
     5L,   3L,   1:3,
     6L,   3L,   c(1L, 3L, 2L),
     7L,   3L,   c(3L, 2L, 1L),
     8L,   3L,   c(1L, 3L, 2L),
     9L,   4L,   c(1L, 2L, 4L, 3L),
    10L,   3L,   c(3L, 2L, 1L)
    )
df
#> # A tibble: 10 x 3
#>       id   len vec      
#>    <int> <int> <list>   
#>  1     1     1 <int [1]>
#>  2     2     2 <int [2]>
#>  3     3     2 <int [2]>
#>  4     4     3 <int [3]>
#>  5     5     3 <int [3]>
#>  6     6     3 <int [3]>
#>  7     7     3 <int [3]>
#>  8     8     3 <int [3]>
#>  9     9     4 <int [4]>
#> 10    10     3 <int [3]>

dfg <- df %>% 
    group_by(vec) %>% 
    summarise(n = n()
             ,total_len = sum(len))
#> `summarise()` ungrouping output (override with `.groups` argument)
dfg           
#> # A tibble: 6 x 3
#>   vec           n total_len
#>   <list>    <int>     <int>
#> 1 <int [1]>     1         1
#> 2 <int [2]>     2         4
#> 3 <int [3]>     2         6
#> 4 <int [3]>     2         6
#> 5 <int [3]>     2         6
#> 6 <int [4]>     1         4

df$vec[4] == df$vec[5]
#> Error in df$vec[4] == df$vec[5]: comparison of these types is not implemented

identical(df$vec[4], df$vec[5])
#> [1] TRUE

df %>% filter(vec == c(1L, 2L, 3L))
#> Warning in vec == c(1L, 2L, 3L): longer object length is not a multiple of
#> shorter object length
#> Error: Problem with `filter()` input `..1`.
#> x 'list' object cannot be coerced to type 'integer'
#> i Input `..1` is `vec == c(1L, 2L, 3L)`.

df %>% filter(identical(vec, c(1L, 2L, 3L)))
#> # A tibble: 0 x 3
#> # ... with 3 variables: id <int>, len <int>, vec <list>

df %>% filter(identical(vec, vec[5]))
#> # A tibble: 0 x 3
#> # ... with 3 variables: id <int>, len <int>, vec <list>

Created on 2021-01-13 by the reprex package (v0.3.0)

标签: rvectorfilterdplyrcomparison

解决方案


投入rowwise并检查length要比较的向量以避免警告。

library(dplyr)

compare <- c(1L, 2L, 3L)

df %>% 
  rowwise() %>%
  filter(length(vec) == length(compare) && all(vec == compare))

#     id   len vec      
#  <int> <int> <list>   
#1     4     3 <int [3]>
#2     5     3 <int [3]>

我们可以filter先按长度,这在较大的数据集上可能会更快。

df %>% 
  filter(lengths(vec) == length(compare)) %>%
  rowwise() %>%
  filter(all(vec == compare)) 

基本 R 中的类似逻辑:

subset(df, sapply(vec, function(x) 
                  length(x) == length(compare) && all(x == compare)))

推荐阅读