首页 > 解决方案 > R - 查询给定对的一对列

问题描述

我有一个包含两列的数据框,它们将作为我的查询的主键。我想提取包含一对感兴趣的标识符的行,并获取关联的值。例如

df <- t(combn(LETTERS, 2))
df <- data.frame(term1 = df[,1], term2 = df[,2], value = sample(10, nrow(df), T))

如果我想,说获取“C”和“Z”对的值,那么我能想到的唯一方法就是

cz <- intersect(union(which(df[,1] == "C"), which(df[,2] == "C")), union(which(df[,1] == "Z"), which(df[,2] == "Z")))
df[cz,]

有没有更有效的方法来做到这一点?我的数据框有大约 50,000 行,我需要执行此操作至少几百万次。所以我想尽可能高效。

谢谢

标签: rsearch

解决方案


如果您担心速度, data.table 应该更快。如果您像我一样不熟悉 data.table 语法,那么 dtplyr 可以轻松完成。在下面的基准测试中,dtplyr 看起来比上面的基本 R 选项快 3-5 倍。而且,至少对我来说,它更容易阅读。

library(data.table)
library(dtplyr)
library(dplyr, warn.conflicts = FALSE)
library(microbenchmark)

# Creating our test table
df <- tibble(
  term1 = sample(LETTERS, 50000, replace = T),
  term2 = sample(LETTERS, 50000, replace = T),
  value = sample(10, 50000, T)    
)

# lazy version of the test table is for dtplyr
df_lazy <- lazy_dt(df)

# answer proposed above
cz <- intersect(union(which(df[,1] == "C"), which(df[,2] == "C")), union(which(df[,1] == "Z"), which(df[,2] == "Z")))
df[cz,]

# a dtplyr answer
cz_dtplyr <- df_lazy %>%
  filter((term1 == "C" & term2 == "Z") | (term1 == "Z" & term2 == "C"))

#benchmarking the 2 options
benchmarks <- microbenchmark(
  "base_union" = intersect(union(which(df[,1] == "C"), which(df[,2] == "C")), union(which(df[,1] == "Z"), which(df[,2] == "Z"))),
  "dtplyr" = df_lazy %>%
    filter((term1 == "C" & term2 == "Z") | (term1 == "Z" & term2 == "C"))
)

benchmarks

Unit: microseconds
       expr    min      lq     mean  median      uq    max neval
 base_union 1669.9 1703.15 2127.677 1755.45 2046.40 6121.8   100
     dtplyr  666.8  692.70  744.486  722.10  779.65 1042.2   100

推荐阅读