首页 > 解决方案 > 如何在 R 的管道中添加条件

问题描述

我的数据是 450K(DNA 甲基化数据)。以下结果来自区域分析。它包含三列:染色体编号、起始位置和结束位置:

region <- structure(list(chr = c(2L, 2L, 2L, 3L, 4L, 5L, 5L, 5L, 6L, 6L, 7L, 8L, 10L, 11L, 12L, 15L, 16L, 18L, 18L, 21L, 22L), start = c(95663987L, 80531500L, 154334651L, 24536765L, 187476837L, 16179633L, 2751822L, 63461803L, 133562246L, 29521568L, 49813031L, 24772270L, 128593922L, 30038286L, 6649733L, 65913660L, 51184152L, 6414602L, 5543801L, 22370347L, 24890330L), end = c(95664360L, 80531899L, 154334652L, 24537302L, 187476838L, 16180267L, 2752602L, 63461931L, 133562777L, 29521715L, 49813487L, 24772351L, 128594418L, 30038311L, 6649995L, 65913661L, 51184887L, 6415253L, 5543946L, 22370759L, 24891142L)), class = "data.frame", row.names = c(4L, 12L, 15L, 14L, 20L,8L, 10L, 18L, 1L, 16L, 5L, 6L, 2L, 21L, 9L, 17L, 13L, 7L, 19L, 11L, 3L))

我所在地区的分布是:

table(region$chr)

第一个染色体是chr2,chich在这里包含四个区域。

现在我有另一个探针文件,其中包含带有染色体和位置的探针。我想要做的是提取位于我的目标区域的探针。这是探测文件:

probe <- structure(list(chr = c(6L, 12L, 16L, 1L, 13L, 17L, 16L, 13L, 3L, 17L, 20L, 8L, 12L, 17L, 8L, 6L, 15L, 16L, 16L, 16L, 6L, 1L, 7L, 18L, 2L, 8L, 16L, 10L, 11L, 12L, 1L, 15L, 1L, 11L, 13L, 13L, 6L, 6L, 9L, 12L, 1L, 12L, 13L, 13L, 6L, 1L, 2L, 3L, 11L, 22L, 15L, 11L, 19L, 19L, 1L, 6L, 10L, 3L, 4L, 17L, 10L, 8L, 6L, 2L, 8L, 16L, 1L, 2L, 16L, 9L, 6L, 19L, 10L, 4L, 4L, 17L, 11L, 4L, 1L, 1L, 5L, 3L, 12L, 16L, 7L, 11L, 4L, 6L, 19L, 14L, 17L, 1L, 4L, 7L, 11L, 5L, 5L, 2L, 2L, 8L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), pos = c(159064992L, 114367005L, 28835671L, 200003800L, 42692969L, 73780663L, 65236094L, 114057675L, 23713773L, 56326765L, 44142512L, 103668081L, 111806472L, 4437077L, 8871457L, 143771621L, 29993498L, 696801L, 79623625L, 69385761L, 30685686L, 76190435L, 14031049L, 3732002L, 32853151L, 146233339L, 71757240L, 131844944L, 128424176L, 89749142L, 27693242L, 57138252L, 43123399L, 57407842L, 29067224L, 53191387L, 30921630L, 107971593L, 125133314L, 109915400L, 46668882L, 14720858L, 67804654L, 23500367L, 170398571L, 150241781L, 85843232L, 15106710L, 33758223L, 44350860L, 83726483L, 76814245L, 3789435L, 55013663L, 166846008L, 150289488L, 3187835L, 169684620L, 1340602L, 35297146L, 61569177L, 122954569L, 71276472L, 9563665L, 9952926L, 81040735L, 15392793L, 55183957L, 27228679L, 139334396L, 44090748L, 3979938L, 125425262L, 10687769L, 503198L, 55191642L, 19735701L, 184244831L, 10738664L, 17446073L, 140739501L, 49384054L, 56618196L, 71324066L, 27221689L, 8041137L, 149033953L, 169224907L, 3933591L, 76450658L, 46152449L, 93250590L, 1025591L, 37024552L, 1360335L, 156277860L, 157098423L, 85980756L, 2575755L, 142138643L, 80531898L, 80531597L, 80531656L, 95664233L, 95664359L, 95664243L, 80531645L, 80531599L, 80531500L, 80531842L, 95663987L, 80531751L, 154334651L, 80531633L)), row.names = c("cg13598865", "cg02666265", "cg16662787", "cg10513702", "cg10970751", "cg08536977", "cg09084496", "cg08794696", "cg18648917", "cg20272962", "cg03013946", "cg07028608", "cg10361696", "cg06618629", "cg25307778", "cg00888489", "cg21092551", "cg07760369", "cg04317962", "cg08627125", "cg18512512", "cg13901901", "cg13524180", "cg18761756", "cg23633993", "cg07013148", "cg06190759", "cg14070745", "cg11552868", "cg26635451", "cg03201274", "cg25063425", "cg04482817", "cg05082527", "cg24850711", "cg25194273", "cg18964706", "cg01485362", "cg14154487", "cg22511293", "cg01431908", "cg20219035", "cg18855836", "cg06743703", "cg07489447", "cg16269716", "cg12737876", "cg00001245", "cg24871046", "cg07065008", "cg02104456", "cg13466901", "cg17880816", "cg23352067", "cg26870903", "cg12489846", "cg04144333", "cg02399652", "cg24269412", "cg03146993", "cg17307051", "cg20129534", "cg07968224", "cg07814910", "cg02192555", "cg07629951", "cg13322252", "cg18456312", "cg02871891", "cg07874283", "cg26371345", "cg07663404", "cg07036530", "cg17677988", "cg16619777", "cg25182165", "cg20686479", "cg04184793", "cg22513691", "cg17183414", "cg04246144", "cg05383531", "cg25245322", "cg02244933", "cg05516617", "cg11111132", "cg07760722", "cg05357093", "cg08248181", "cg00780666", "cg26932693", "cg14681854", "cg23853026", "cg08044454", "cg22317004", "cg05907764", "cg05482973", "cg03128635", "cg01968492", "cg03460049", "cg00465284", "cg00549910", "cg02856109", "cg03445516", "cg06816651", "cg09409539", "cg09482777", "cg11231249", "cg12078605", "cg21621248", "cg24871414", "cg26355577", "cg26649384", "cg27629977"), class = "data.frame")

以下是我尝试过的:逐个染色体和逐个区域地提取探针。让我们以 chr2 为例。

chr2 %>% probe %>% subset(chr==2) %>% subset(pos >= 95663987 & pos <= 95664360 | pos >= 80531500 & pos <= 80531899 | pos >= 154334651 & pos <= 154334652) 

它运行良好,显示了位于这四个区域的 14 个探针。但是,我的真实区域文件在每条染色体上都有更多区域。现在是时候将所有“开始”和“结束”数字放入代码中了。所以我想要一个更简单的代码来提取探针,至少一个染色体一个染色体。

以下是我尝试过的:

chr2.df <- probe %>% subset(chr==2) %>% subset(pos >= region$start & pos <= region$end) 

它没有显示任何区域...

任何人都可以帮助我 - 如何不使用区域文件中的详细“开始”和“结束”编号来提取探针?

太感谢了。

标签: rconditional-statementspipeline

解决方案


如果您的目标是识别位于每个染色体区域的探针,那么我认为这段代码就足够了:

library(magrittr)
pdf <- tibble::as_tibble(probe ) %>% dplyr::mutate(probe = rownames(probe))

region %>% 
  tibble::as_tibble() %>%
  dplyr::left_join(pdf, by = "chr") %>%
  dplyr::filter(pos < end, pos > start)

我首先加载包magrittr,这让我可以使用“管道”功能,%>%. 然后,我创建一个 tibble(一个数据框),将探针作为(新)列。这反映了我不将行名与数据框一起使用的偏好。

然后,我将其转换region为 tibble(一种数据框),然后将其left_joindplyr包中传递给函数。此函数通过“chr”的公共值“合并”或“连接”两个数据帧。由于在 region 和 pdf 中都有重复的“chr”值,我们得到多行,例如,“chr”值为 2。

最后,我使用函数filterfrom仅选择值介于和之间dplyr的那些行。posstartend

我希望这个对你有用。


推荐阅读