首页 > 解决方案 > 如何在R中的几个步骤中收集列而不丢失分组

问题描述

我需要将宽数据集转换为长数据集,并且有 16 列必须收敛为 4。每 4 列包含彼此相关的信息,并且该信息不能在转换中“丢失”。

我有来自四个块的排名任务的数据,它基本上给了我一个数据集,其中信息被分为四组宽格式。即first_image,first_sex,first_score,second_image,second_sex,second_score ...

我已经尝试过 group_by 和 gather() 的各种组合,但我还差得很远。

我已经阅读了将多组测量列(宽格式)重塑为单列(长格式),但恐怕我并不聪明。

我已经制作了一些关于某个参与者的数据是什么样子的示例数据,并且我还制作了一个我希望数据看起来如何的示例。


library(tidyverse)

sample_dat <- data.frame(subject_id = rep("sj1", 4),
                         first_pick = rep(1, 4),
                         first_image_pick = (c("a", "b", "c", "d")),
                         first_pick_neuro = rep("TD", 4),
                         first_pick_sex = rep("F", 4),
                         second_pick = rep(2, 4),
                         second_image_pick = (c("e", "f", "g", "h")),
                         second_pick_neuro = rep("TD", 4),
                         second_pick_sex = rep("M", 4),
                         third_pick = rep(3, 4),
                         third_image_pick = (c("i", "j", "k", "l")),
                         third_pick_neuro = rep("DS", 4),
                         third_pick_sex = rep("F", 4),
                         fourth_pick = rep(4, 4),
                         fourth_image_pick = (c("m", "n", "o", "p")),
                         fourth_pick_neuro = rep("DS", 4),
                         fourth_pick_sex = rep("M", 4))

预期输出:


final_data <- data.frame(subject_id = rep("sj1", 16),
                         image = c("a", "b", "c", "d",
                                   "e", "f", "g", "h",
                                   "i", "j", "k", "l",
                                   "m", "n", "o", "p"),
                         rank = rep(c(1, 2, 3, 4), each = 4), # from the numbers in the first_pick, second_pick etc. 
                         neuro = rep(c("TD", "DS"), each = 8),
                         sex = rep(c("F", "M", "F", "M"), each = 4))

到目前为止,我已经尝试过了,但是它只复制了所有信息:


sample_dat_long <- sample_dat %>%
  group_by(subject_id) %>%
  gather(Pick, Image,
         first_image_pick,
         second_image_pick,
         third_image_pick,
         fourth_image_pick)  

所以基本上我不想在收集数据时丢失每张图像的信息(选择、性别、神经)。

任何帮助都会很棒!

标签: r

解决方案


我们可以用它来做到这一点,它可以从“宽”到“长”格式进行多次melt重塑。在这里,带有子字符串 'image'、'neuro'、'sex' 的列名被重新整形为单独的列以获得预期的输出data.tablemeasure patterns

library(data.table)
melt(setDT(sample_dat), measure = patterns("image", "neuro", "sex"), 
   value.name = c("image", "neuro", "sex"), variable.name = 'rank')[, 
    .(subject_id, rank, image, neuro, sex)]

推荐阅读