首页 > 解决方案 > PostgreSQL sum(count > 0 时的情况)在分区每周和每小时的问题

问题描述

使用 PostgreSQL 版本 9.4.18

以下是返回 non_zero_year_count 和 percent_years_count_not_zero 的意外结果的查询:

表数据:从 1988-2018 年,但在 sqlfiddle 测试数据库中刚刚完成了 2016-2018 年的下表。http://sqlfiddle.com/#!17/34289a/19

CREATE TABLE ltg_data
("intensity" int, "time" timestamp with time zone, "lon" int, "lat" int)

(200, '2018-06-23 07:19:00', -122.109, 42.9446),
(200, '2018-06-24 07:19:00', -122.109, 42.9446),
(200, '2018-06-25 07:19:00', -122.109, 42.9446),
(200, '2018-06-26 07:19:00', -122.109, 42.9446),
(200, '2018-06-26 07:19:00', -122.109, 42.9446),
(200, '2018-06-24 07:19:00', -122.109, 42.9446),
(200, '2018-06-25 07:19:00', -122.109, 42.9446),
(200, '2018-06-26 07:19:00', -122.109, 42.9446),
(200, '2018-06-26 07:19:00', -122.109, 42.9446),
(200, '2018-06-24 07:19:00', -122.109, 42.9446),
(200, '2018-06-25 07:19:00', -122.109, 42.9446),
(200, '2018-06-26 07:19:00', -122.109, 42.9446),
(200, '2018-06-26 07:19:00', -122.109, 42.9446),
(200, '2018-06-24 07:19:00', -122.109, 42.9446),
(200, '2018-06-25 07:19:00', -122.109, 42.9446),
(200, '2018-06-26 07:19:00', -122.109, 42.9446),
(200, '2018-06-25 17:19:00', -122.109, 42.9446),
(200, '2018-06-25 17:19:00', -122.109, 42.9446),
(200, '2017-06-25 19:19:00', -122.109, 42.9446),
(200, '2017-06-25 20:19:00', -122.109, 42.9446),
(200, '2017-06-26 07:19:00', -122.109, 42.9446),
(200, '2017-06-26 07:19:00', -122.109, 42.9446),
(200, '2017-06-24 07:19:00', -122.109, 42.9446),
(200, '2017-06-24 07:19:00', -122.109, 42.9446),
(200, '2017-06-23 21:19:00', -122.109, 42.9446),
(200, '2017-06-23 21:19:00', -122.109, 42.9446),
(200, '2017-06-24 07:19:00', -122.109, 42.9446),
(200, '2017-06-24 07:19:00', -122.109, 42.9446),
(200, '2017-06-26 07:19:00', -122.109, 42.9446),
(200, '2017-06-26 07:19:00', -122.109, 42.9446),
(200, '2016-06-26 07:19:00', -122.109, 42.9446),
(200, '2016-06-25 07:19:00', -122.109, 42.9446),
(200, '2016-06-25 07:19:00', -122.109, 42.9446),
(200, '2016-06-27 07:19:00', -122.109, 42.9446),
(200, '2016-06-26 07:19:00', -122.109, 42.9446),
(200, '2016-06-26 07:19:00', -122.109, 42.9446)

所以下面的查询应该返回一些关于表数据的基本统计信息。我认为,挑战在于尝试在一年中的一周和小时内进行分区,同时以某种方式合并一年。错误数据涉及查询的一部分,该部分试图确定一年中一周和小时 (woyhh) 的计数 > 0 的年数。这是查询使用的查询和函数(用于每年标准化闰年的 woy 函数)。我正在使用生成系列,因为我想要一整年的 woyhh,即使在某个 woyhh 中没有计数。

功能:

create or replace function IsLeapYear(int)
returns boolean as $$
select $1 % 4 = 0 and ($1 % 100 <> 0 or $1 % 400 = 0)
$$ LANGUAGE sql IMMUTABLE STRICT; 

create or replace function f_woyhh(timestamp with time zone)
returns int language plpgsql as $$
declare
currentYear int = extract (year from $1);
LeapYearShift int = 1 + (IsLeapYear(currentYear) and $1 > make_date  (currentYear, 2, 28))::int;
begin
return CONCAT(((extract(doy from $1)::int)- LeapYearShift) / 7+ 1, to_char   ($1, 'HH24'));
end;
$$;

询问:

WITH
CTE_Dates
AS
(
SELECT  f_woyhh(d) as dt


    ,EXTRACT(YEAR FROM d::timestamp) AS dtYear from
generate_series(timestamp '2016-01-01', timestamp '2018-12-31', interval '1 hour') as d
    -- full range of possible dates
)
,CTE_WeeklyHourlyCounts
AS
(
SELECT
f_woyhh(time) as dt
    ,time
    ,count(*) AS ct
FROM
    ltg_data
    GROUP BY ltg_data.time
)

,CTE_FullStats
AS
(
SELECT
    CTE_dates.dt as woyhh

    ,COUNT(DISTINCT CTE_Dates.dtYear)  AS years_count
    ,SUM(CASE WHEN CTE_WeeklyHourlyCounts.ct > 0 THEN 1 ELSE 0 END) OVER   (PARTITION BY CTE_Dates.dt) AS nonzero_year_count
,100.0 * SUM(CASE WHEN CTE_WeeklyHourlyCounts.ct > 0 THEN 1 ELSE 0 END)   OVER (PARTITION BY CTE_Dates.dt)
    / COUNT(DISTINCT CTE_Dates.dtYear) as percent_years_count_not_zero
FROM
    CTE_Dates
    LEFT JOIN CTE_WeeklyHourlyCounts ON CTE_WeeklyHourlyCounts.dt = CTE_Dates.dt
    GROUP BY CTE_dates.dt, CTE_WeeklyHourlyCounts.ct, CTE_WeeklyHourlyCounts.dt
    )

SELECT
woyhh
,nonzero_year_count
,years_count
,percent_years_count_not_zero
FROM
CTE_FullStats
WHERE woyhh::text like '26%'
    GROUP BY woyhh,   years_count, nonzero_year_count,     percent_years_count_not_zero
    ORDER BY  woyhh

意想不到的结果:

woyhh | nonzero_year_count | years_count| percent_years_count_not_zero
2605  | 0                  | 3          | 0
2606  | 0                  | 3          | 0
2607  | 5                  | 3          | 200
2608  | 0                  | 3          | 0
2609  | 0                  | 3          | 0

对 woyhh 2607 不起作用的结果部分是 nonzero_year_count,它应该是 3,因为只有 3 年的数据,并且在这些年的第 26 周和第 07 小时(24 日之后的任何一天)都有一个计数这个月是第 26 周)。此外,percent_years_count_not_zero 应该是 100%,而不是 200%。100% 是最大的期望 percent_years_count_not_zero。

期望的结果:

woyhh | nonzero_year_count | years_count| percent_years_count_not_zero
2605  | 0                  | 3          | 0
2606  | 0                  | 3          | 0
2607  | 3                  | 3          | 100
2608  | 0                  | 3          | 0
2609  | 0                  | 3          | 0

所以我认为主要问题在于查询的这一部分:

,SUM(CASE WHEN CTE_WeeklyHourlyCounts.ct > 0 THEN 1 ELSE 0 END) OVER  (PARTITION BY CTE_Dates.dt) AS nonzero_year_count

如果我在 woyhh 上进行分区,但这还不够,因为我需要考虑这些年。就像我需要以某种方式将年份组合在一起以确定一年中是否发生了 woyhh,然后将其算作一个,而不是那一年。我尝试过合并年份,但遇到了更奇怪的结果。

我希望这能澄清我的一些问题。我在下面包含了一个更新的 sqlfiddle,以复制我用于测试表的数据/查询。谢谢你的帮助!

http://sqlfiddle.com/#!17/34289a/19

标签: sqlpostgresql

解决方案


推荐阅读