r - finding each pattern ina set of sequences
问题描述
I am trying to find a pattern in a set of sequences. What i am trying to do is that first take the 1st sequence of the pattern and match it with all subjects one by one and give an output (p1 vs s1,p1 vs s2, p1 vs s3, p1 vs s4), then take the 2nd pattern and match with all subjects (p2 vs s1, p2 vs s2, p2 vs s3,p2 vs s4) and so on i.e. in an iterative way. The input (pattern and subjects) are DNAstringSet instance (Biostrings).
I have used the function
mat=nucleotideSubstitutionMatrix(match=2,mismatch = -3,baseOnly = TRUE)
localAlign=pairwiseAlignment(pattern,subject,type="local",
substitutionMatrix=mat,
gapOpening=-5, gapExtension=-2)
But this way it actually matches p1 vs s1, p2 vs s2, p3 vs s3 nd p4 vs s4
Example:
input:pattern
A DNAStringSet instance of length 734
width seq names
[1] 1000 GGTAAGAGTTTCTTAACAGATCTCAACATTTGCTATATAC...AGATTATTTGTCCTTTGAGATAAAATTACCAC P1
[2] 1000 TGTAAGTAATACTTAATGGTAATTTTTGTTTTCTCTTTCA...AGAAGCAAGGAGACCCGTTAGAGGAAGCATCC P2
[3] 1000 GGTGAGTGTATGATTGATAACTAATCTCTTAGATTAACCA...CATGATATGAAATGGTTCCTAAAGATCCAGAC P3
[4] 1000 GGTGAGCAAAATCAAGCAATGCATTGTTTGTTTTGGAGGG...CTATTTATGTACTACCTTTTTTTTTTAGAAAA P4
input: subject
A DNAStringSet instance of length 1000
width seq names
[1] 1000 GTAGGTACCTGGGAATTCACAAATTAAGACTTTTGAATA...TTCTTATTCAACCGTAGTAACATTAGATGAATA S1
[2] 1000 GTGAGCGCTGCTGCCCAAGCCGCCTGGCTATGCTCGATT...AGATGGCCTTTTCTCTCAGCCCACTGTGACCTA S2
[3] 1000 GTAAGTACAGGCTGAAAGTTACATGCTCTCCAAGGGTGA...ACATAGTAATGAATAGACTTTCAGACACAGCAT S3
[4] 1000 GTAAGTTGCTTGTTTCTTAAATGTTAGGATCTATTACTT...AACAATATAGGTAAGTCTAGCCCTCAAGGCGCT S4
解决方案
如果我正确理解了你的问题,那么你可以试试这个。
library(dplyr)
df <- pattern_df %>%
left_join(subject_df, "seq") %>%
group_by(width.x, seq, names.x) %>%
summarise(subject_names = paste(names.y, collapse=",")) %>%
`colnames<-`(c("width", "seq", "names", "subject_names"))
输出是:
> df
width seq names subject_names
1 1000 GGTAAGAGTTTCTTAACAGATCTCAACATTTGCTATATAC...AGATTATTTGTCCTTTGAGATAAAATTACCAC P1 NA
2 1000 GGTGAGCAAAATCAAGCAATGCATTGTTTGTTTTGGAGGG...CTATTTATGTACTACCTTTTTTTTTTAGAAAA P4 S5,S6
3 1000 GGTGAGTGTATGATTGATAACTAATCTCTTAGATTAACCA...CATGATATGAAATGGTTCCTAAAGATCCAGAC P3 NA
4 1000 TGTAAGTAATACTTAATGGTAATTTTTGTTTTCTCTTTCA...AGAAGCAAGGAGACCCGTTAGAGGAAGCATCC P2 NA
示例数据:(
请注意,我在末尾添加了两行subject_df
)
pattern_df <- structure(list(width = c(1000L, 1000L, 1000L, 1000L), seq = c("GGTAAGAGTTTCTTAACAGATCTCAACATTTGCTATATAC...AGATTATTTGTCCTTTGAGATAAAATTACCAC",
"TGTAAGTAATACTTAATGGTAATTTTTGTTTTCTCTTTCA...AGAAGCAAGGAGACCCGTTAGAGGAAGCATCC",
"GGTGAGTGTATGATTGATAACTAATCTCTTAGATTAACCA...CATGATATGAAATGGTTCCTAAAGATCCAGAC",
"GGTGAGCAAAATCAAGCAATGCATTGTTTGTTTTGGAGGG...CTATTTATGTACTACCTTTTTTTTTTAGAAAA"
), names = c("P1", "P2", "P3", "P4")), .Names = c("width", "seq",
"names"), class = "data.frame", row.names = c(NA, -4L))
# width seq names
#1 1000 GGTAAGAGTTTCTTAACAGATCTCAACATTTGCTATATAC...AGATTATTTGTCCTTTGAGATAAAATTACCAC P1
#2 1000 TGTAAGTAATACTTAATGGTAATTTTTGTTTTCTCTTTCA...AGAAGCAAGGAGACCCGTTAGAGGAAGCATCC P2
#3 1000 GGTGAGTGTATGATTGATAACTAATCTCTTAGATTAACCA...CATGATATGAAATGGTTCCTAAAGATCCAGAC P3
#4 1000 GGTGAGCAAAATCAAGCAATGCATTGTTTGTTTTGGAGGG...CTATTTATGTACTACCTTTTTTTTTTAGAAAA P4
subject_df <- structure(list(width = c(1000L, 1000L, 1000L, 1000L, 1000L, 1000L
), seq = c("GTAGGTACCTGGGAATTCACAAATTAAGACTTTTGAATA...TTCTTATTCAACCGTAGTAACATTAGATGAATA",
"GTGAGCGCTGCTGCCCAAGCCGCCTGGCTATGCTCGATT...AGATGGCCTTTTCTCTCAGCCCACTGTGACCTA",
"GTAAGTACAGGCTGAAAGTTACATGCTCTCCAAGGGTGA...ACATAGTAATGAATAGACTTTCAGACACAGCAT",
"GTAAGTTGCTTGTTTCTTAAATGTTAGGATCTATTACTT...AACAATATAGGTAAGTCTAGCCCTCAAGGCGCT",
"GGTGAGCAAAATCAAGCAATGCATTGTTTGTTTTGGAGGG...CTATTTATGTACTACCTTTTTTTTTTAGAAAA",
"GGTGAGCAAAATCAAGCAATGCATTGTTTGTTTTGGAGGG...CTATTTATGTACTACCTTTTTTTTTTAGAAAA"
), names = c("S1", "S2", "S3", "S4", "S5", "S6")), .Names = c("width",
"seq", "names"), class = "data.frame", row.names = c(NA, -6L))
# width seq names
#1 1000 GTAGGTACCTGGGAATTCACAAATTAAGACTTTTGAATA...TTCTTATTCAACCGTAGTAACATTAGATGAATA S1
#2 1000 GTGAGCGCTGCTGCCCAAGCCGCCTGGCTATGCTCGATT...AGATGGCCTTTTCTCTCAGCCCACTGTGACCTA S2
#3 1000 GTAAGTACAGGCTGAAAGTTACATGCTCTCCAAGGGTGA...ACATAGTAATGAATAGACTTTCAGACACAGCAT S3
#4 1000 GTAAGTTGCTTGTTTCTTAAATGTTAGGATCTATTACTT...AACAATATAGGTAAGTCTAGCCCTCAAGGCGCT S4
#5 1000 GGTGAGCAAAATCAAGCAATGCATTGTTTGTTTTGGAGGG...CTATTTATGTACTACCTTTTTTTTTTAGAAAA S5
#6 1000 GGTGAGCAAAATCAAGCAATGCATTGTTTGTTTTGGAGGG...CTATTTATGTACTACCTTTTTTTTTTAGAAAA S6
推荐阅读
- ajax - 如何使用 WWW::Mechanize (ajax) 发布表单
- azure - 为什么我的 Azure Web App Bot 在某些频道中可以正常工作,但在 Facebook 或“网络聊天测试”中却不行
- r - R中的分类模型
- bluetooth - 需要帮助尝试使 BLE 设备被发现
- google-visualization - 获取 Google Charts 直方图条的数据项或条的范围
- javascript - 如何对以特定格式返回的日期进行单元测试
- angular - ThreeJS,Ionic,Angular2+ - 如何修改位置 BufferGeometry / BufferAttribute 值 - 它说 Index Signature ArrayLike
是只读的 - angular - Angular 应用程序的 Docker 映像构建失败
- java - Spring boot 使用 PropertySource 为每个环境加载配置
- flutter - Firestore 集合快照不会转换为 csv