sql - 获取每个标签在postgreSQL中单词的词频排名
问题描述
我有一个带有下表的 postgreSQL 数据库:
CREATE TABLE dummy (
created_at TIMESTAMPTZ,
tweet TEXT,
label INT);
插入以下数据:
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo squared', 1);
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo fox', 2);
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'foo bar', 3);
INSERT INTO dummy VALUES ('2020-12-18 00:00:00+00', 'raven bar', 2);
INSERT INTO dummy VALUES ('2020-12-17 00:00:00+00', 'bar standard', 1);
INSERT INTO dummy VALUES ('2020-12-17 00:00:00+00', 'bar none', 2);
INSERT INTO dummy VALUES ('2020-12-17 00:00:00+00', 'bar none', 3);
我有以下查询,它获取表中每个单词的频率并对它们进行排名,按天分组:
select *
from (
select date_trunc('day', created_at) as created_day, word, count(*) as cnt,
rank() over(partition by date_trunc('day', created_at) order by count(*) desc) rn
from dummy d
cross join lateral regexp_split_to_table(
trim(regexp_replace(tweet, '\y(rt|co|https|amp|none)\y', '', 'g')),
'\s+'
) w(word)
group by created_day, word
) t
where (created_day > CURRENT_DATE - INTERVAL '10 days')
and word IS NOT NULL
order by created_day DESC, rn;
结果:
created_day | word | cnt | rn
------------------------+----------+-----+----
2020-12-18 00:00:00+00 | foo | 3 | 1
2020-12-18 00:00:00+00 | bar | 2 | 2
2020-12-18 00:00:00+00 | squared | 1 | 4
2020-12-18 00:00:00+00 | raven | 1 | 4
2020-12-18 00:00:00+00 | fox | 1 | 4
2020-12-17 00:00:00+00 | bar | 3 | 1
2020-12-17 00:00:00+00 | standard | 1 | 2
我想按标签分组,以便新查询显示按标签过滤的单词排名,如下所示:
created_day | word | cnt | rn | label |
------------------------+----------+-----+--------------
2020-12-18 00:00:00+00 | foo | 3 | 1 | 1 |
2020-12-18 00:00:00+00 | bar | 2 | 2 | 1 |
2020-12-18 00:00:00+00 | squared | 1 | 3 | 1 |
2020-12-17 00:00:00+00 | standard | 3 | 1 | 1 |
2020-12-17 00:00:00+00 | raven | 2 | 2 | 1 |
2020-12-17 00:00:00+00 | fox | 1 | 3 | 1 |
我尝试添加一个 WHERE 子句来按预测过滤:
where (created_day > CURRENT_DATE - INTERVAL '10 days')
and word IS NOT NULL
and prediction = 1
order by created_day DESC, rn;
但我得到了错误column "prediction" does not exist
。
我将如何解决这个问题?
解决方案
我认为你需要一个聚合来做你想做的事,因为你在同一天有相同的标签。所以:
select d.*,
rank() over (partition by created_day order by cnt desc) as rn
from (select d.created_at::date as created_day, d.label, w.word,
sum(count(*)) over (partition by d.created_at::date, w.word) as cnt
from dummy d cross join lateral
regexp_split_to_table(trim(regexp_replace(tweet, '\y(rt|co|https|amp|none)\y', '', 'g')
), '\s+'
) w(word)
where d.created_at >= current_date - interval '10 days'
group by created_day, d.label, w.word
) d
推荐阅读
- html - 使用 CodeRunner VSCode 扩展运行 Html/Css
- javascript - 如何配置 VS Code 以自动关闭括号?
- javascript - Express & Multer 不上传文件单个/任何导致 req.file/req.files 未定义或 []
- discord.js - 我正在尝试从 client.users.fetch(user id) 获取值,但它不起作用
- java - 使用 Spring 数据 Elastisearch 索引别名和零停机时间
- flutter - PlatformException 错误,无效的文档参考,尝试使用 Firebase 存储时
- postgresql - Postgres CTE 成倍地节省时间?
- javascript - 如何在javascript中为金额添加后缀
- asp.net-web-api - 使用 ASP.net web api 发送一个 json 文件
- r - R optim 函数找不到最小值