首页 > 解决方案 > 获取每个标签在postgreSQL中单词的词频排名

问题描述

我有一个带有下表的 postgreSQL 数据库:

CREATE TABLE dummy (
created_at TIMESTAMPTZ,
tweet TEXT,
label INT);

插入以下数据:

INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo squared', 1);
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo fox', 2);
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo bar', 3);
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'raven bar', 2);
INSERT INTO dummy VALUES ('2020-12-17 00:00:00+00', 'bar standard', 1);
INSERT INTO dummy VALUES ('2020-12-17 00:00:00+00', 'bar none', 2);
INSERT INTO dummy VALUES ('2020-12-17 00:00:00+00', 'bar none', 3);

我有以下查询,它获取表中每个单词的频率并对它们进行排名,按天分组:

select *
from (
    select date_trunc('day', created_at) as created_day, word, count(*) as cnt,
        rank() over(partition by date_trunc('day', created_at) order by count(*) desc) rn
    from dummy d
    cross join lateral regexp_split_to_table(
        trim(regexp_replace(tweet, '\y(rt|co|https|amp|none)\y', '', 'g')),
        '\s+'
    ) w(word)
    group by created_day, word
) t
where (created_day > CURRENT_DATE - INTERVAL '10 days') 
    and word IS NOT NULL
order by created_day DESC, rn;

结果:

      created_day       |   word   | cnt | rn
------------------------+----------+-----+----
 2020-12-18 00:00:00+00 | foo      |   3 |  1
 2020-12-18 00:00:00+00 | bar      |   2 |  2
 2020-12-18 00:00:00+00 | squared  |   1 |  4
 2020-12-18 00:00:00+00 | raven    |   1 |  4
 2020-12-18 00:00:00+00 | fox      |   1 |  4
 2020-12-17 00:00:00+00 | bar      |   3 |  1
 2020-12-17 00:00:00+00 | standard |   1 |  2

我想按标签分组,以便新查询显示按标签过滤的单词排名,如下所示:

      created_day       |   word   | cnt | rn | label |
------------------------+----------+-----+--------------
 2020-12-18 00:00:00+00 | foo      |   3 |  1 |     1 |
 2020-12-18 00:00:00+00 | bar      |   2 |  2 |     1 |
 2020-12-18 00:00:00+00 | squared  |   1 |  3 |     1 |
 2020-12-17 00:00:00+00 | standard |   3 |  1 |     1 |
 2020-12-17 00:00:00+00 | raven    |   2 |  2 |     1 |
 2020-12-17 00:00:00+00 | fox      |   1 |  3 |     1 |

我尝试添加一个 WHERE 子句来按预测过滤:

where (created_day > CURRENT_DATE - INTERVAL '10 days')
    and word IS NOT NULL
    and prediction = 1
order by created_day DESC, rn;

但我得到了错误column "prediction" does not exist

我将如何解决这个问题?

标签: sqlpostgresql

解决方案


认为你需要一个聚合来做你想做的事,因为你在同一天有相同的标签。所以:

select d.*,
       rank() over (partition by created_day order by cnt desc) as rn
from (select d.created_at::date as created_day, d.label, w.word, 
             sum(count(*)) over (partition by d.created_at::date, w.word) as cnt
      from dummy d cross join lateral
           regexp_split_to_table(trim(regexp_replace(tweet, '\y(rt|co|https|amp|none)\y', '', 'g')
                                     ), '\s+'
                                ) w(word)
       where d.created_at >= current_date - interval '10 days'
       group by created_day, d.label, w.word
      ) d

推荐阅读