mysql - MySQL/Clickhouse 组合/RANK 选择的结果与一组标签并保持其余的唯一
问题描述
我在数天/数月内收集了许多推文,并正在监视单词簇(主题标签)。推文的主数据库每天收集应用程序 500 万条推文,并将主题标签提取到单独的表中。计算这些主题标签以显示在一段时间(天/月)内发展的热图。
收集数据库是 MYSQL,其中主推文表是应用程序 500 M 记录,主题标签表是应用程序 175 M 记录。然后复制到 Clickhouse 进行分析。
以冠状病毒为例,下面的列表显示可以更好地将几个主题标签组合在一起,以提高统计数据的可见性。
问题:
- 如何在一个定义的“标签”或别名中添加用于选择相似词/主题标签的过滤器?完毕
- 如何使用多个别名,所有别名都与每组过滤器/选择标准一起使用? 完毕
- 如何使用 RANK 或类似方法不使用累积计数列出,而是使用 RANK?
SELECT (match(hashtag, '[Cc]orona.*|COVID.*|[Cc]ovid.*') ? 'COVID19' : hashtag) as Hashtag,
SUM(CASE when datetime between now() - interval 1 day AND now() then 1 END) "Today",
SUM(CASE when datetime between now() - interval 2 day AND now() - interval 1 day then 1 END) "Today -1",
SUM(CASE when datetime between now() - interval 3 day AND now() - interval 2 day then 1 END) "Today -2",
SUM(CASE when datetime between now() - interval 4 day AND now() - interval 3 day then 1 END) "Today -3",
SUM(CASE when datetime between now() - interval 5 day AND now() - interval 4 day then 1 END) "Today -4",
SUM(CASE when datetime between now() - interval 6 day AND now() - interval 5 day then 1 END) "Today -5",
SUM(CASE when datetime between now() - interval 7 day AND now() - interval 6 day then 1 END) "Today -6",
SUM(CASE when datetime between now() - interval 8 day AND now() - interval 7 day then 1 END) "Today -7"
FROM twitterDBhashtags
group by Hashtag
order by "Today" DESC limit 20;
在twitterDBhashtags表上:
id BIGINT(20) PK
hashtag VARCHAR(75)
datetime DATETIME
产生这个结果:
Hashtag Today Today -1 Today -2 Today -3 Today -4 Today -5 Today -6 Today -7
------------------------------------------------------------------------------------------------------------------
COVID19 245 799 253 088 241 731 226 515 249 281 84 088 149 789 117 015
BhulaDungaFirstLook 36 379 34
StPatricksDay 12 622 410 251 233 307 72 194 176
BhulaDungaWithSid 12 595 47
QuarantineLife 10 742 2 339 59 1 1
UPDATE 9 432 534 1 063 340 884 215 336 242
BREAKING 7 038 11 737 10 434 6 985 10 726 4 345 6 748 5 091
SidNaaz 6 012 2 247 4 115 1 692 2 065 241 1 502 1 236
China 5 840 4 803 4 887 5 472 7 039 2 086 3 392 3 748
FamiliesFirst 4 578 420 902 6 480 5 952 1 326
iHeartAwards 4 540 5 274 6 846 5 412 6 747 2 500 6 559 4 767
HomeOfSoul_Satlok 4 341
TrumpVirus 4 094 750 752 1 381 1 935 624 590 1 176
100WAYS 4 055 106 125 22
TEAMWANG 4 014 101 107 78 34 21 160 127
ChineseVirus 3 919 1 3 4 69 32 15 2
ShipsGoingDown 3 755 71
在@vladimir 非常好的输入之后,使用
SELECT case when match(hashtag, '[Cc]orona.*|COVID.*|[Cc]ovid.*') then 'COVID19'
when match(hashtag, 'Bhula.*') then 'Bhula'
else hashtag END
as Hashtag,
SUM(CASE when datetime between now() - interval 1 day AND now() then 1 END) "Today",
SUM(CASE when datetime between now() - interval 2 day AND now() - interval 1 day then 1 END) "Today -1",
SUM(CASE when datetime between now() - interval 3 day AND now() - interval 2 day then 1 END) "Today -2",
SUM(CASE when datetime between now() - interval 4 day AND now() - interval 3 day then 1 END) "Today -3",
SUM(CASE when datetime between now() - interval 5 day AND now() - interval 4 day then 1 END) "Today -4",
SUM(CASE when datetime between now() - interval 6 day AND now() - interval 5 day then 1 END) "Today -5",
SUM(CASE when datetime between now() - interval 7 day AND now() - interval 6 day then 1 END) "Today -6",
SUM(CASE when datetime between now() - interval 8 day AND now() - interval 7 day then 1 END) "Today -7"
FROM twitterDBhashtags
group by Hashtag
order by "Today" DESC limit 10;
我明白了(请注意这是实时数据,因此上述结果的总和将不准确)
Hashtag Today Today -1 Today -2 Today -3 Today -4 Today -5 Today -6 Today -7
---------------------------------------------------------------------------------------------------------
COVID19 241825 260486 237838 236318 222989 129159 161506 122959
Bhula 35267 22372 856 1 13 4 1 12
StPatricksDay 14776 1147 254 239 271 130 198 167
QuarantineLife 10442 5140 169 1 2
AsiManshiDebut 8900
LuzonLockdown 6764 9
FamiliesFirst 6563 382 439 3285 8854 1307 927
Italy 6516 2617 4590 4493 2710 1725 3287 8885
BREAKING 6391 9878 10726 8603 9830 4305 8464 4992
China 5469 5745 4417 5279 5753 4290 3556 3408
现在,如何将其作为 RANK,而不是计数,在 RANK 上进行排序。
任何关于如何进步的想法都将不胜感激。
解决方案
我会在 WHERE 子句中定义日期期间,而不是在 SELECT 中列出它们:
SELECT toStartOfDay(datetime) day, match(hashtag, '[Cc]orona.*|COVID.*') ? 'COVID19' : hashtag as hashtag, count() tweets_count
FROM (
/* test data */
SELECT toDateTime(data.1) datetime, data.2 hashtag
FROM (
SELECT arrayJoin([
('2020-03-01 10:10:10', 'coronavirus'),
('2020-03-01 12:12:12', 'COVID'),
('2020-03-05 10:10:10', 'StPatricksDay'),
('2020-03-15 01:01:01', 'Coronavirus')]) data)
)
WHERE datetime >= '2020-03-01 00:00:00' AND datetime < '2020-04-01 00:00:00'
GROUP BY day, hashtag;
/* result
┌─────────────────day─┬─hashtag───────┬─tweets_count─┐
│ 2020-03-01 00:00:00 │ COVID19 │ 2 │
│ 2020-03-15 00:00:00 │ COVID19 │ 1 │
│ 2020-03-05 00:00:00 │ StPatricksDay │ 1 │
└─────────────────────┴───────────────┴──────────────┘
*/
SELECT (match(hashtag, '[Cc]orona.*|COVID.*') ? 'COVID19' : hashtag) as hashtag, sum(day1) day1, sum(day2) day2, sum(day3) day3, sum(day4) day4, sum(day5) day5, sum(day6) day6, sum(day7) day7
FROM (
/* test data */
SELECT data.1 AS hashtag, data.2 AS day1, data.3 AS day2, data.4 AS day3, data.5 AS day4, data.6 AS day5, data.7 AS day6, data.8 AS day7
FROM
(
SELECT arrayJoin([
('coronavirus', 67299, 60633, 53780, 55375, 59866, 27150, 47824),
('COVID', 62502, 50998, 50365, 51554, 50062, 23140, 40908),
('BhulaDungaFirstLook', 35524, 34, 0, 0, 0, 0, 0),
('Coronavirus', 14076, 15297, 12321, 16496, 16263, 7028, 9975),
('CoronavirusOutbreak', 13020, 9410, 2597, 1044, 1853, 950, 2436),
('BhulaDungaWithSid', 12190, 47, 0, 0, 0, 0, 0),
('StPatricksDay', 10426, 374, 244, 233, 282, 79, 213),
('QuarantineLife', 10110, 1477, 56, 0, 1, 0, 0),
('COVID2019', 9892, 2085, 1417, 2009, 2929, 1568, 4918)]) AS data
))
GROUP BY hashtag;
/* result
┌─hashtag─────────────┬───day1─┬───day2─┬───day3─┬───day4─┬───day5─┬──day6─┬───day7─┐
│ COVID19 │ 166789 │ 138423 │ 120480 │ 126478 │ 130973 │ 59836 │ 106061 │
│ StPatricksDay │ 10426 │ 374 │ 244 │ 233 │ 282 │ 79 │ 213 │
│ QuarantineLife │ 10110 │ 1477 │ 56 │ 0 │ 1 │ 0 │ 0 │
│ BhulaDungaFirstLook │ 35524 │ 34 │ 0 │ 0 │ 0 │ 0 │ 0 │
│ BhulaDungaWithSid │ 12190 │ 47 │ 0 │ 0 │ 0 │ 0 │ 0 │
└─────────────────────┴────────┴────────┴────────┴────────┴────────┴───────┴────────┘
*/
推荐阅读
- powershell - PowerShell 从文本文件中提取值并导出到 csv
- wordpress - 每 10 分钟更新一次用户元数据
- makefile - 为生成的目标设置不同的变量值
- java - 在数组中查找 3 个连续重复的字符?
- php - 如何在 localhost(xampp) 上运行 laravel 和 reactjs 应用程序
- sql - 具有所有字段名称和名称不存在的空值的 UNION
- python-3.x - 使用 PyPDF2 从 PDF 文件中提取文本
- c# - C# 从 RegistryKey 获取 BaseKey
- python - For循环在Python中查找匹配值和总计
- reactjs - React Native 函数 bind() [函数绑定]