r - 如何计算重复项和输出变量的来源
问题描述
我在 R 中有 12 个不同长度的变量。它们每个都包含一个氨基酸列表。我已经使用 cbind() 将它们合并到一个数据框中,我想输出每个序列重复的次数以及在哪个变量中找到了重复项。我已成功使用 table() 函数输出序列重复的次数,但我找不到确定重复出现位置的方法。
以下是数据示例:
S1 <- c("CVVSTNGGSGTYKYIF", "CVVSLKF_GYALNF", "CAVNTQCSPDTCSLYPNLPCPRNA*CRAGVQT_VLFRSVWRV*NDGFCNDLEHCVSNFGNEKLTF")
S2 <- c("CVVSTNGGSGTYKYIF", "CVVSLKF_GYALNF", "CAVNTQCSPDTCSLYPNLPCPRNA*CRAGVQT_VLFRSVWRV*NDGFCNDLEHCVSNFGNEKLTF", "CAAGYGKLTF", "CVVL_ALMF")
S3 <- c("CAVNTQCSPDTCSLYPNLPCPRNA*CRAGVQT_VLFRSVWRV*NDGFCNDLEHCVSNFGNEKLTF", "CAAGYGKLTF", "CVVL_ALMF")
n <- max(length(S1), length(S2), length(S3))
length(S1) <- n
length(S2) <- n
length(S3) <- n
clones <- cbind(S1, S2, S3)
freq <- as.data.frame(table(clones))
freq
这输出:
克隆 | 频率 |
---|---|
CAAGYGKLTF | 2 |
CAVNTQCSPDTCSLYPNLPCPRNA CRAGVQT_VLFRSVWRV NDGFCNDLEHCVSNFGNEKLTF | 3 |
CVVL_ALMF | 2 |
CVVSLKF_GYALNF | 2 |
CVVSTNGGSGTYKYIF | 2 |
但我想要的输出是:
克隆 | 频率 | 多变的 |
---|---|---|
CAAGYGKLTF | 2 | S2,S3 |
CAVNTQCSPDTCSLYPNLPCPRNA CRAGVQT_VLFRSVWRV NDGFCNDLEHCVSNFGNEKLTF | 3 | S1,S2,S3 |
CVVL_ALMF | 2 | S2,S3 |
CVVSLKF_GYALNF | 2 | S1,S2 |
CVVSTNGGSGTYKYIF | 2 | S1,S2 |
任何帮助,将不胜感激!
解决方案
vars <- c("S1", "S2", "S3") # Create a vector with the names of the variables for each vector of amino acid sequences
# For each vector of amino acid sequences, create a dataframe where the second column is
# the vector of amino acid sequences, and the first column is the name of the variable
# where the sequence is stored
dfs <- lapply(vars , function(x) as.data.frame(cbind(x, get(x)), stringsAsFactors = FALSE))
# Combine all the dataframe created in the previous step, into a single dataframe
combined_df <- do.call(rbind, dfs)
# Name the columns of the dataframe
names(combined_df) <- c("Variable", "Sequence")
library(dplyr)
combined_df %>%
group_by(Sequence) %>%
summarise(Freq = n(), Variable = paste(Variable, collapse = ",")) %>%
arrange(Sequence)