首页 > 解决方案 > 如何相交并将分数添加到列?

问题描述

我有两个数据集,我想找到它们之间的重叠/相交/公共区域,如果有任何重叠,然后提取每个初始表:

资料一:

   chr  start   end             
 chr1     25     35 
 chr1     50     70   
 chr1     60     85   

资料乙:

chr     start   end   score               
 chr1     10     15    24
 chr1     55     75    14
 chr1     76     82    10 

输出表:

输出1:共同区域的结果

 chr    start   end             
 chr1     55     70   
 chr1     70     75
 chr1     76     82   

输出 2:从数据 A 中提取:

 chr    start   end             
 chr1     50     70   
 chr1     60     85  

输出 3:从数据 B 中提取:

chr     start   end   score               
 chr1     55     75    14
 chr1     76     82    10 

我尝试了不同的方法,但我不知道哪种方法最好:

library(GenomicRanges)
enhancer = with(dataA, GRanges(chr, IRanges(start=start, end=end)))
H3K4me1= with(dataB, GRanges(chr, IRanges(start=start, end=end)))

方式1:

hits <- findOverlaps(dataA, dataB)
ranges(dataA)[queryHits(hits)] = ranges(dataB)[subjectHits(hits)]
dataA
dataB

方式2:

over<- subsetByOverlaps(dataA, dataB)

方式3:

inter = intersect(dataA, dataB)

方式4:

groupA <- data.table(dataA)
setkey(groupA, chr, start, end)

groupB <- data.table(dataB)
setkey(groupB, chr, start, end)

over <- foverlaps(groupA, groupB, nomatch = 0)
over2 <- data.table(
  chr = over$chr,
  start = over[, ifelse(start > i.start, start, i.start)],
  end = over[, ifelse(end < i.end, end, i.end)])

标签: rintersectionoverlapbioconductor

解决方案


我不确定这是否是你想要的。您介意创建一个可重现的示例,如此所述。

library(dplyr)

DataA <- data.frame(chr = c("chr1", "chr1", "chr1"), start = c(25,50,60), end = c(35,70,85))
DataB <- data.frame(chr = c("chr1", "chr1", "chr1"), start = c(10,55,76), end = c(15,75,82), score = c(24,14,10))

luA <-  Map(`:`, DataA$start, DataA$end)
luA <- data.frame(value = unlist(luA),
                 index = rep(seq_along(luA), lapply(luA, length)))

DataA[luA$index[match(DataB$start, luA$value)],]
DataB[luA$index[match(DataB$start, luA$value)],]

推荐阅读