我有一个数据集,其中 500 人随机回答 5 个问题,其中包含 275 个问题,范围为 1-5。


df <- tibble(id = rep(1:500, 5), 
       q = sample.int(n = 275, size = max(id)*5, replace = T),
       ans = sample.int(n = 5, size = max(id)*5, replace = T))

我的任务是针对每个人,从 5 个回答(其他人也回答过的回答)中随机选择一个,并与随机选择的回答相同问题的其他人进行检查。如果两个响应相同,我将其标记为真,否则我将其标记为假。


sampled_q <- 
df %>%
  group_by(q) %>% 
  mutate(n_times_answer = n()) %>% 
  filter(n_times_answer >= 2) %>% 
  group_by(id) %>% 
  sample_n(1) %>% 
  transmute(id, q, assigned = T)

df %>%

但从这里我不知道如何处理支票。这也是低效的,因为一旦我检查了一个人的回复,我就检查了两个回复,所以我在技术上可以为两个人标记 T/F,尽管高效对我来说并不是高优先级。


df %>%  
  pivot_wider(id_cols = id, 
              names_from = q,
              values_from = ans) %>% 



从每个回答者那里抽取 1 个有效问题,然后将其加入df.

df %>%
  group_by(q) %>%
  filter(n_distinct(id) > 1) %>% # Keep only questions that have more than one answerer
  group_by(id) %>%
  sample_n(1) %>% # Keep only one question from each answerer
  inner_join(df, by = "q") %>% # Join it back on the main table to identify other answers to the same question
  filter(id.x != id.y) %>% # Don't include answers from the same answerer
  group_by(id.x) %>%
  sample_n(1) %>% # Keep only one other answer
  mutate(matched = ans.x == ans.y) # Check if the answers matched
#> # A tibble: 500 x 6
#> # Groups:   id.x [500]
#>     id.x     q ans.x  id.y ans.y matched
#>    <int> <int> <int> <int> <int> <lgl>  
#>  1     1   175     3   106     3 TRUE   
#>  2     2    15     5   117     4 FALSE  
#>  3     3   256     4   366     3 FALSE  
#>  4     4   268     4   194     4 TRUE   
#>  5     5   161     3   485     5 FALSE  
#>  6     6   100     1   390     4 FALSE  
#>  7     7   248     5   307     2 FALSE  
#>  8     8   126     5   341     4 FALSE  
#>  9     9    65     2    93     2 TRUE   
#> 10    10    48     1   461     5 FALSE  
#> # … with 490 more rows
