首页 > 解决方案 > 如果一个数据框中的数字符合另一个数据框定义的条件,则打印两个数据集中的信息

问题描述

我有两个带有 head() 的大数据框:

数据框一:

family_name st_pos
  <chr>        <dbl>
1 AluSp        26791
2 AluJo        31436
3 AluSx        39624
4 AluSz6       40738
5 AluYe5       51585
6 AluSc        62160  

数据框二:

external_gene_name start_position end_position
1             ATP1A2      160115759    160143591
2               GCLM       93885199     93909456
3                TPR      186311652    186375693
4             VPS13D       12230030     12512047
5              SZRD1       16352575     16398145
6             ATP2B4      203626561    203744081

我想要做的是,如果数据帧一中的 st_pos 中的数字大于“start_position”并且小于“end_position”,那么我想打印一个具有下面指示的列名的新表。

external_gene_name    family_name     st_pos

我对 R 真的很陌生,我什至不知道从哪里开始。非常感谢您“指数化”我的学习曲线。

标签: r

解决方案


该软件包GenomicRanges专门设计用于解决此问题。

您可能知道,Alus 的样本中没有一个与您提供的基因重叠。所以我编了一些。

library(GenomicRanges)
Alus <- GRanges(seqnames = "chr1",
                ranges = IRanges(start = df1$st_pos, width = 1),
                names = df1$family_name)
Alus
#GRanges object with 6 ranges and 1 metadata column:
#      seqnames    ranges strand |    names
#         <Rle> <IRanges>  <Rle> | <factor>
#  [1]     chr1 160115859      * |    AluSp
#  [2]     chr1  93885299      * |    AluJo
#  [3]     chr1 186312452      * |    AluSx
#  [4]     chr1  12230230      * |   AluSz6
#  [5]     chr1 203627561      * |   AluYe5
#  [6]     chr1     62160      * |    AluSc

Genes <- GRanges(seqnames = "chr1",
                 ranges = IRanges(start = df2$start_position, end = df2$end_position),
                 names = df2$external_gene_name)
Genes
#GRanges object with 6 ranges and 1 metadata column:
#      seqnames              ranges strand |    names
#         <Rle>           <IRanges>  <Rle> | <factor>
#  [1]     chr1 160115759-160143591      * |   ATP1A2
#  [2]     chr1   93885199-93909456      * |     GCLM
#  [3]     chr1 186311652-186375693      * |      TPR
#  [4]     chr1   12230030-12512047      * |   VPS13D
#  [5]     chr1   16352575-16398145      * |    SZRD1
#  [6]     chr1 203626561-203744081      * |   ATP2B4

然后您可以使用findOverlaps查找两个范围之间的重叠:

Overlaps <- findOverlaps(Genes,Alus)
data.frame(Genes[queryHits(Overlaps),],Alus[subjectHits(Overlaps),])
#  seqnames     start       end  width strand  names seqnames.1   start.1     end.1 width.1 strand.1 names.1
#1     chr1 160115759 160143591  27833      * ATP1A2       chr1 160115859 160115859       1        *   AluSp
#2     chr1  93885199  93909456  24258      *   GCLM       chr1  93885299  93885299       1        *   AluJo
#3     chr1 186311652 186375693  64042      *    TPR       chr1 186312452 186312452       1        *   AluSx
#4     chr1  12230030  12512047 282018      * VPS13D       chr1  12230230  12230230       1        *  AluSz6
#5     chr1 203626561 203744081 117521      * ATP2B4       chr1 203627561 203627561       1        *  AluYe5

如果每个基因有多个重叠,就会有多个行。

样本数据

df1 <- structure(list(family_name = structure(c(3L, 1L, 4L, 5L, 6L, 
2L), .Label = c("AluJo", "AluSc", "AluSp", "AluSx", "AluSz6", 
"AluYe5"), class = "factor"), st_pos = c(160115859L, 93885299L, 
186312452L, 12230230L, 203627561L, 62160L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

df2 <- structure(list(external_gene_name = structure(c(1L, 3L, 5L, 6L, 
4L, 2L), .Label = c("ATP1A2", "ATP2B4", "GCLM", "SZRD1", "TPR", 
"VPS13D"), class = "factor"), start_position = c(160115759L, 
93885199L, 186311652L, 12230030L, 16352575L, 203626561L), end_position = c(160143591L, 
93909456L, 186375693L, 12512047L, 16398145L, 203744081L)), class = "data.frame", row.names = c("1", 
"2", "3", "4", "5", "6"))

推荐阅读