r - 如果一个数据框中的数字符合另一个数据框定义的条件,则打印两个数据集中的信息
问题描述
我有两个带有 head() 的大数据框:
数据框一:
family_name st_pos
<chr> <dbl>
1 AluSp 26791
2 AluJo 31436
3 AluSx 39624
4 AluSz6 40738
5 AluYe5 51585
6 AluSc 62160
数据框二:
external_gene_name start_position end_position
1 ATP1A2 160115759 160143591
2 GCLM 93885199 93909456
3 TPR 186311652 186375693
4 VPS13D 12230030 12512047
5 SZRD1 16352575 16398145
6 ATP2B4 203626561 203744081
我想要做的是,如果数据帧一中的 st_pos 中的数字大于“start_position”并且小于“end_position”,那么我想打印一个具有下面指示的列名的新表。
external_gene_name family_name st_pos
我对 R 真的很陌生,我什至不知道从哪里开始。非常感谢您“指数化”我的学习曲线。
解决方案
该软件包GenomicRanges
专门设计用于解决此问题。
您可能知道,Alus 的样本中没有一个与您提供的基因重叠。所以我编了一些。
library(GenomicRanges)
Alus <- GRanges(seqnames = "chr1",
ranges = IRanges(start = df1$st_pos, width = 1),
names = df1$family_name)
Alus
#GRanges object with 6 ranges and 1 metadata column:
# seqnames ranges strand | names
# <Rle> <IRanges> <Rle> | <factor>
# [1] chr1 160115859 * | AluSp
# [2] chr1 93885299 * | AluJo
# [3] chr1 186312452 * | AluSx
# [4] chr1 12230230 * | AluSz6
# [5] chr1 203627561 * | AluYe5
# [6] chr1 62160 * | AluSc
Genes <- GRanges(seqnames = "chr1",
ranges = IRanges(start = df2$start_position, end = df2$end_position),
names = df2$external_gene_name)
Genes
#GRanges object with 6 ranges and 1 metadata column:
# seqnames ranges strand | names
# <Rle> <IRanges> <Rle> | <factor>
# [1] chr1 160115759-160143591 * | ATP1A2
# [2] chr1 93885199-93909456 * | GCLM
# [3] chr1 186311652-186375693 * | TPR
# [4] chr1 12230030-12512047 * | VPS13D
# [5] chr1 16352575-16398145 * | SZRD1
# [6] chr1 203626561-203744081 * | ATP2B4
然后您可以使用findOverlaps
查找两个范围之间的重叠:
Overlaps <- findOverlaps(Genes,Alus)
data.frame(Genes[queryHits(Overlaps),],Alus[subjectHits(Overlaps),])
# seqnames start end width strand names seqnames.1 start.1 end.1 width.1 strand.1 names.1
#1 chr1 160115759 160143591 27833 * ATP1A2 chr1 160115859 160115859 1 * AluSp
#2 chr1 93885199 93909456 24258 * GCLM chr1 93885299 93885299 1 * AluJo
#3 chr1 186311652 186375693 64042 * TPR chr1 186312452 186312452 1 * AluSx
#4 chr1 12230030 12512047 282018 * VPS13D chr1 12230230 12230230 1 * AluSz6
#5 chr1 203626561 203744081 117521 * ATP2B4 chr1 203627561 203627561 1 * AluYe5
如果每个基因有多个重叠,就会有多个行。
样本数据
df1 <- structure(list(family_name = structure(c(3L, 1L, 4L, 5L, 6L,
2L), .Label = c("AluJo", "AluSc", "AluSp", "AluSx", "AluSz6",
"AluYe5"), class = "factor"), st_pos = c(160115859L, 93885299L,
186312452L, 12230230L, 203627561L, 62160L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
df2 <- structure(list(external_gene_name = structure(c(1L, 3L, 5L, 6L,
4L, 2L), .Label = c("ATP1A2", "ATP2B4", "GCLM", "SZRD1", "TPR",
"VPS13D"), class = "factor"), start_position = c(160115759L,
93885199L, 186311652L, 12230030L, 16352575L, 203626561L), end_position = c(160143591L,
93909456L, 186375693L, 12512047L, 16398145L, 203744081L)), class = "data.frame", row.names = c("1",
"2", "3", "4", "5", "6"))
推荐阅读
- angular - 为什么升级到 Angular 7 Ecma6 不起作用?
- winapi - Win32 API TerminateProcess() 返回成功但进程没有被杀死
- c# - 单元测试中的 DeterministicTaskScheduler vs AsyncContext
- mysql - 在 Ubuntu 18.04 中启动 MySQL 社区服务器失败
- python - 在python中计算.wav文件的频谱图
- javascript - 将隐藏的溢出内容添加到新容器
- python - 从列表python中删除两个连续元素
- linux - 从文件中查找定义的数据并用作变量
- python - 具有多个网格规范和水平和垂直邻接的 Matplotlib 仪表板
- javascript - 在 Javascript 中处理价格和舍入