首页 > 解决方案 > 如何计算重复项和输出变量的来源

问题描述

我在 R 中有 12 个不同长度的变量。它们每个都包含一个氨基酸列表。我已经使用 cbind() 将它们合并到一个数据框中,我想输出每个序列重复的次数以及在哪个变量中找到了重复项。我已成功使用 table() 函数输出序列重复的次数,但我找不到确定重复出现位置的方法。

以下是数据示例:

S1 <- c("CVVSTNGGSGTYKYIF", "CVVSLKF_GYALNF", "CAVNTQCSPDTCSLYPNLPCPRNA*CRAGVQT_VLFRSVWRV*NDGFCNDLEHCVSNFGNEKLTF")

S2 <- c("CVVSTNGGSGTYKYIF", "CVVSLKF_GYALNF", "CAVNTQCSPDTCSLYPNLPCPRNA*CRAGVQT_VLFRSVWRV*NDGFCNDLEHCVSNFGNEKLTF", "CAAGYGKLTF", "CVVL_ALMF")

S3 <- c("CAVNTQCSPDTCSLYPNLPCPRNA*CRAGVQT_VLFRSVWRV*NDGFCNDLEHCVSNFGNEKLTF", "CAAGYGKLTF", "CVVL_ALMF")

n <- max(length(S1), length(S2), length(S3))
length(S1) <- n
length(S2) <- n
length(S3) <- n

clones <- cbind(S1, S2, S3)

freq <- as.data.frame(table(clones))
freq

这输出:

克隆 频率
CAAGYGKLTF 2
CAVNTQCSPDTCSLYPNLPCPRNA CRAGVQT_VLFRSVWRV NDGFCNDLEHCVSNFGNEKLTF 3
CVVL_ALMF 2
CVVSLKF_GYALNF 2
CVVSTNGGSGTYKYIF 2

但我想要的输出是:

克隆 频率 多变的
CAAGYGKLTF 2 S2,S3
CAVNTQCSPDTCSLYPNLPCPRNA CRAGVQT_VLFRSVWRV NDGFCNDLEHCVSNFGNEKLTF 3 S1,S2,S3
CVVL_ALMF 2 S2,S3
CVVSLKF_GYALNF 2 S1,S2
CVVSTNGGSGTYKYIF 2 S1,S2

任何帮助,将不胜感激!

标签: rcountduplicates

解决方案


vars <- c("S1", "S2", "S3") # Create a vector with the names of the variables for each vector of amino acid sequences

# For each vector of amino acid sequences, create a dataframe where the second column is
# the vector of amino acid sequences, and the first column is the name of the variable
# where the sequence is stored
dfs <- lapply(vars , function(x) as.data.frame(cbind(x, get(x)), stringsAsFactors = FALSE))

# Combine all the dataframe created in the previous step, into a single dataframe
combined_df <- do.call(rbind, dfs)
# Name the columns of the dataframe
names(combined_df) <- c("Variable", "Sequence")

library(dplyr)

combined_df %>% 
    group_by(Sequence) %>% 
    summarise(Freq = n(), Variable = paste(Variable, collapse = ",")) %>% 
    arrange(Sequence)

推荐阅读