首页 > 解决方案 > PostgreSQL 查询:我想查找 DNA 序列中每个碱基的计数

问题描述

我有 5 个随机 DNA 序列(20 个长度的 DNA 碱基),我想找到碱基数。

在第一部分中,我准备了一个 DNA 长度函数来生成 5 x 20 DNA 碱基长度的序列。但我想找到基数。序列中有多少“A”,序列中有多少“C”,序列中有多少“G”,序列中有多少“T”。

prepare dna_length(int) as
with t1 as (select chr(65) as s union select chr(67) union select chr(71) union select chr(84) )
, t2 as ( select s, row_number() over() as rn from t1)
, t3 as ( select generate_series(1,$1) as i,round(random() * 4 + 0.5) as rn )
, t4 as ( select t2.s from t2 join t3 on (t2.rn=t3.rn))
select array_to_string(array(select s from t4),'') as dna;

with t1 as (
    select 1 as rn, 'A' as s
    union select 2, 'C' 
    union select 3, 'T' 
    union select 4, 'G' 
), t2 as (
    select generate_series(1, 5) as sample
), t3 as ( 
    select t2.sample, generate_series(1,20) as i,
           round(random() * 4 + 0.5) as rn 
      from t2
), t4 as (
    select t3.sample, t3.i, t3.rn, t1.s
      from t3 
      join t1 on t1.rn = t3.rn
) 
select sample, string_agg(s, '' order by i) 
  from t4
 group by sample
 order by sample;

现在它看起来像这样:

id          DNA          
1   ACTGCTGCAGTCGTACGTAC 
2   TGCAGTCGTAGCTGACGTAG 
3   GCAGTGACCAACGTGTGACA 
4   TGACGTGTCGAGACGAAGAG 
5   CGTGTGAGAGTCGTAGAGTG

结果应如下所示:

id          DNA            A   C   G   T
1   ACTGCTGCAGTCGTACGTAC   4   6   5   5
2   TGCAGTCGTAGCTGACGTAG   4   4   6   6
3   GCAGTGACCAACGTGTGACA   6   5   6   4
4   TGACGTGTCGAGACGAAGAG   4   3   8   3
5   CGTGTGAGAGTCGTAGAGTG   4   2   9   5

标签: sqldatabasepostgresql

解决方案


您可以在最终查询中进行条件计数:

with ...
select 
    sample, 
    string_agg(s, '' order by i) dna,
    count(*) filter(where s = 'A') a,
    count(*) filter(where s = 'C') c,
    count(*) filter(where s = 'G') g,
    count(*) filter(where s = 'T') t
from t4
group by sample
order by sample;

推荐阅读