首页 > 解决方案 > 找到两个相关分数向量之间共享的前 N ​​个元素

问题描述

我有两个数据表,train 和 target,由行中的样本和列中的化学物质组成,表值是样本中化学物质的相对丰度。两个数据集之间的化学物质是相同的。我已经找到了训练数据和目标数据中的值之间的 Spearman 相关性的绝对值,现在我想找到最小的i,使得两个数组的前i个元素包含共同的n 个元素。

示例:假设我们正在查看化学物质 Y1,训练和目标与化学物质 Y1 到 Y10 的相关值为:

     train
     Y1  Y2  Y3  Y4  Y5  Y6  Y7  Y8  Y9  Y10
Y1:   1  -1 -.2  .5 -.9  .7  .1  .1 -.2  -.5

     target
     Y1  Y2  Y3  Y4  Y5  Y6  Y7  Y8  Y9  Y10
Y1:   1  .1  .2  -.7 .6  .4  .2  .5 -.5  -.2

每个绝对值的排名顺序为:

     train
Y1:  Y1  Y2  Y5  Y6  Y4  Y10 Y9  Y3  Y7  Y8  
     target
Y1:  Y1  Y4  Y5  Y8  Y9  Y6  Y3  Y7  Y10 Y2

那么train和target之间的前5个共享元素是:

Y1:  Y1, Y5, Y4, Y6, Y9

因此对于 n = 5,两个数组的前 7 个元素具有共同的 Y1、Y5、Y4、Y6 和 Y9。并且比较它们的算法必须找到第 7 个元素才能找到两个列表中的 5 个。最坏的情况是,它必须到达第 10 个元素。

这是我尝试过的:

有任何想法吗?

标签: rdata-structuresstatistics

解决方案


该算法将使用来自 2 个集合 'train' 和 'target' 的i最小排名元素构建包含n 个公共元素的集合,大小相等m,复杂度为O(log(m)+m)

基本上,计算等于每个有序集合中每个元素的排名的分数,并在相应的元素之间进行比较。这个想法是只有在另一个列表中的相应元素没有排名更高的情况下才将一个元素添加到公共列表中。

当通过添加 2 个集合的不同元素获得相同的i时(例如,当n = 7、Y7 或 Y10 可以选择时),“train”集合将被任意偏爱。

#Calculate the rank of each element in each set
trainrank <- rank(train)
targetrank <- rank(target)

#Sort both sets and attribute each element their rank
trainscores <- order(trainrank)
targetscores <- order(targetrank)

#Include elements of the train set if their ranking is
# superior or equal to those of the target set
includetrain <- trainscores>=targetscores
includetrain <- includetrain[trainrank]

#Include elements of the target set if their ranking is
# strictly superior to those of the train set
includetarget <- targetscores>trainscores
includetarget <- includetarget[targetrank]

#To get a set containing n common elements 
# from 2 sets of equal size m,
# this code will take 2*m operations at most
commonset <- c()
m = length(train)
n = 5

i = 1
while (length(commonset) < n){
  newelement <- NA
  while(i <= m & is.na(newelement)){
    #If the selection of train or target elements
    # gave the same i first elements,
    # this would favor the train element
    if(includetrain[i]){
      newelement <- train[i]
      includetrain[i] <- FALSE
    }
    else if (includetarget[i]){
      newelement <- target[i]
      includetarget[i] <- FALSE
    }
    else{
      i = i+1 #Next element if both are false
    }
  }
    commonset <- c(commonset, newelement)
}
commonset #Common set of n elements
# "Y1" "Y5" "Y4" "Y6" "Y9"
print(i) #First i elements used to build the common set
# 7

原始数据

#Train and target data sets
train <- c("Y1","Y2","Y5","Y6","Y4","Y10","Y9","Y3","Y7","Y8")
target <- c("Y1","Y4","Y5","Y8","Y9","Y6","Y3","Y7","Y10","Y2")

推荐阅读