首页 > 解决方案 > 加快 SQLite 中的 COUNT DISTINCT 查询

问题描述

更新:添加背景信息和更多解释

给定下表中的约 30 万条记录

CREATE TABLE t (
    id INTEGER PRIMARY KEY, 
    a TEXT, 
    b TEXT, 
    c TEXT, 
    deleted INTEGER DEFAULT 0
);

CREATE INDEX ix_t ON t (deleted) WHERE deleted = 0;

1) SELECT Count(id) FROM t WHERE deleted = 0;
2) SELECT Count(DISTINCT a) FROM t WHERE deleted = 0;
3) SELECT Count(DISTINCT b) FROM t WHERE deleted = 0;
4) SELECT Count(DISTINCT c) FROM t WHERE deleted = 0;

查询 1 需要 4-5 毫秒。其他三个查询COUNT (DISTINCT <col>)需要 600-900 毫秒。如何将这些查询加快到与第一个查询相同的顺序?我创建了以下索引,但没有帮助

CREATE INDEX ix_t_a ON t (a, deleted) WHERE deleted = 0;
-- (and so on the columns b and c as well)

EXPLAIN QUERY PLAN SELECT Count(DISTINCT a) FROM t WHERE deleted = 0;

-- output: SEARCH TABLE t USING INDEX ix_t (deleted=?)

正如我们在上面看到的,新的索引没有被使用。

deleted列是一个标志,用于跟踪要从所有查询中排除的行。在大多数情况下,这样的行将很少,但重要的是它们不会在计数和选择中使用。

最终行数可能会增长到原来的 3 倍,比如说 1M。即使那样,deleted行数也将是最少的。

计数是COUNT (DISTINCT column)创建方面。假设有人搜索所有行WHERE a = 'foo'。我需要返回匹配的行,我还需要返回这些行中有多少DISTINCT bc存在。所以,我会做类似的事情

-- number of rows in the result set
SELECT COUNT(id) FROM t WHERE deleted = 0 AND a = 'foo';

-- the actual result set
SELECT * FROM t WHERE deleted = 0 AND a = 'foo' LIMIT 30 OFFSET 0;

-- facet counts
SELECT COUNT (DISTINCT b) FROM t WHERE deleted = 0 AND a = 'foo';
SELECT COUNT (DISTINCT c) FROM t WHERE deleted = 0 AND a = 'foo';

在第一个实例中,由于没有WHERE子句,上述查询将是

-- number of rows in the result set
SELECT COUNT(id) FROM t WHERE deleted = 0;

-- the actual result set
SELECT * FROM t WHERE deleted = 0 LIMIT 30 OFFSET 0;

-- facet counts
SELECT COUNT (DISTINCT a) FROM t WHERE deleted = 0;
SELECT COUNT (DISTINCT b) FROM t WHERE deleted = 0;
SELECT COUNT (DISTINCT c) FROM t WHERE deleted = 0;

表中构成方面的实际列大约是 8 或 9。所以,我必须做大约 8 或 9 SELECT COUNT (DISTINCT col)。每个大约需要 600-900 毫秒,也就是将近 6-10 秒。从用户的角度来看,查询太慢了。将选择减少一半或 ⅔ 会产生巨大的影响。

实际上,我也有一个查询缓存,所以任何一次查询的结果都会被缓存,第二次会很快,只要是完全相同的查询。在这里,通过查询,我的意思是来自用户的 pov 的查询。当然,来自用户 pov 的每个查询都会导致 9-10 个数据库查询。然而,速度对于创建高性能应用程序非常重要。

标签: sqlite

解决方案


如何将这些查询加快到与第一个查询相同的顺序?

第1部分

我不确定您是否可以做到这一点,如果可以,这很可能取决于数据本身,例如,如果分析做了一些惊人的事情。

DISTINCT 有相当大的影响。所以这可能是造成差异的唯一原因

如果您只对没有 DISTINCT 的第一个查询使用 EXPLAIN,则结果是:-

在此处输入图像描述

对于其他查询,EXPLAIN 结果是:-

在此处输入图像描述

如何将这些查询加快到与第一个查询相同的顺序?

第2部分

也许考虑修改以下曾经有一点作用的代码,并得出结论 ANALYZE 可能不会改善问题,也没有(一点点)玩弄索引似乎有很大的不同:-

DROP TABLE IF EXISTS t;
CREATE TABLE t (id INTEGER PRIMARY KEY, a TEXT, b TEXT, c TEXT, deleted INTEGER DEFAULT 0);

WITH 
    alphabet(letters) AS (SELECT 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
    cte1(counter,deleted) AS (SELECT 1, abs(random()) % 2  UNION ALL SELECT counter+1, abs(random()) % 2 FROM cte1 LIMIT 300000)
INSERT INTO t (deleted,a,b,c) 
    SELECT deleted,
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1)
    FROM cte1;

SELECT * FROM t; /* COMMENT/UNCOMMENT WITH/WITHOUT -- To HIDE/SHOW TABLE */ 


CREATE INDEX ix_t ON t (deleted) WHERE deleted = 0;

-- EXPLAIN  /* UNCOMMENT BY REMOVING -- ON THIS LINE TO JUST DO EXPLAIN */
-- QUERY PLAN /* TO DO EXPLAIN QUERY PLAN UNCOMMNET LINE ABOVE AS ABOVE and UNCOMMENT THIS SAME WAY */
SELECT Count(id) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM t WHERE deleted = 0;
ANALYZE;
SELECT Count(id) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM t WHERE deleted = 0;

/*<<<<<<<<<< ATTEMPT 2 >>>>>>>>>>*/
DROP TABLE IF EXISTS t;
CREATE TABLE t (id INTEGER PRIMARY KEY, a TEXT, b TEXT, c TEXT, deleted INTEGER DEFAULT 0);

WITH 
    alphabet(letters) AS (SELECT 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
    cte1(counter,deleted) AS (SELECT 1, abs(random()) % 2  UNION ALL SELECT counter+1, abs(random()) % 2 FROM cte1 LIMIT 300000)
INSERT INTO t (deleted,a,b,c) 
    SELECT deleted,
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1)
    FROM cte1;

-- SELECT * FROM t;


CREATE INDEX ix_t ON t (deleted,a,b,c) WHERE deleted = 0;

-- EXPLAIN 
-- QUERY PLAN
SELECT Count(id) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM t WHERE deleted = 0;
ANALYZE;
SELECT Count(id) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM t WHERE deleted = 0;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM t WHERE deleted = 0;

这在运行时(假设使用的工具(在 Navicat 中可以)允许一系列查询)。将要 :-

  1. 删除并创建表 t
  2. 用随机数据填充 t(rowid 除外)
  3. 创建索引(参见代码,因为第二部分不同)
  4. 运行 4 个查询(请参阅打开或关闭解释或解释查询计划的注释)
  5. 进行分析。
  6. 在分析后运行相同的 4 个查询。
  7. 重复 1-6(使用不同的索引)。

示例日志:-

DROP TABLE IF EXISTS t
> OK
> Time: 1.229s


CREATE TABLE t (id INTEGER PRIMARY KEY, a TEXT, b TEXT, c TEXT, deleted INTEGER DEFAULT 0)
> OK
> Time: 0.132s


WITH 
    alphabet(letters) AS (SELECT 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
    cte1(counter,deleted) AS (SELECT 1, abs(random()) % 2  UNION ALL SELECT counter+1, abs(random()) % 2 FROM cte1 LIMIT 300000)
INSERT INTO t (deleted,a,b,c) 
    SELECT deleted,
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1)
    FROM cte1
> Affected rows: 300000
> Time: 0.75s


-- SELECT * FROM t;


CREATE INDEX ix_t ON t (deleted) WHERE deleted = 0
> OK
> Time: 0.195s


EXPLAIN 
-- QUERY PLAN
SELECT Count(id) FROM t WHERE deleted = 0
> OK
> Time: 0s


EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM t WHERE deleted = 0
> OK
> Time: 0s


EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM t WHERE deleted = 0
> OK
> Time: 0s


EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM t WHERE deleted = 0
> OK
> Time: 0s


ANALYZE
> OK
> Time: 0.137s


SELECT Count(id) FROM t WHERE deleted = 0
> OK
> Time: 0.031s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM t WHERE deleted = 0
> OK
> Time: 0.057s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM t WHERE deleted = 0
> OK
> Time: 0.054s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM t WHERE deleted = 0
> OK
> Time: 0.055s


/*<<<<<<<<<< ATTEMPT 2 >>>>>>>>>>*/
DROP TABLE IF EXISTS t
> OK
> Time: 0.891s


CREATE TABLE t (id INTEGER PRIMARY KEY, a TEXT, b TEXT, c TEXT, deleted INTEGER DEFAULT 0)
> OK
> Time: 0.153s


WITH 
    alphabet(letters) AS (SELECT 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
    cte1(counter,deleted) AS (SELECT 1, abs(random()) % 2  UNION ALL SELECT counter+1, abs(random()) % 2 FROM cte1 LIMIT 300000)
INSERT INTO t (deleted,a,b,c) 
    SELECT deleted,
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1)
    FROM cte1
> Affected rows: 300000
> Time: 0.643s


-- SELECT * FROM t;


CREATE INDEX ix_t ON t (deleted,a,b,c) WHERE deleted = 0
> OK
> Time: 0.583s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(id) FROM t WHERE deleted = 0
> OK
> Time: 0.029s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM t WHERE deleted = 0
> OK
> Time: 0.041s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM t WHERE deleted = 0
> OK
> Time: 0.029s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM t WHERE deleted = 0
> OK
> Time: 0.031s


ANALYZE
> OK
> Time: 0.121s


SELECT Count(id) FROM t WHERE deleted = 0
> OK
> Time: 0.038s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM t WHERE deleted = 0
> OK
> Time: 0.046s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM t WHERE deleted = 0
> OK
> Time: 0.029s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM t WHERE deleted = 0
> OK
> Time: 0.031s

第 3 部分

我没有做太多的时间安排,但也许考虑一下(使用仅删除 = 0 行的诱惑表):-

DROP TABLE IF EXISTS t;
CREATE TABLE t (id INTEGER PRIMARY KEY, a TEXT, b TEXT, c TEXT, deleted INTEGER DEFAULT 0);

WITH 
    alphabet(letters) AS (SELECT 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
    cte1(counter,deleted) AS (SELECT 1, abs(random()) % 2  UNION ALL SELECT counter+1, abs(random()) % 2 FROM cte1 LIMIT 300000)
INSERT INTO t (deleted,a,b,c) 
    SELECT deleted,
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1)
    FROM cte1;

-- SELECT * FROM t; /* COMMENT/UNCOMMENT WITH/WITHOUT -- To HIDE/SHOW TABLE */ 

CREATE TEMP TABLE trimmedt AS SELECT * FROM t WHERE deleted = 0;


-- EXPLAIN  /* UNCOMMENT BY REMOVING -- ON THIS LINE TO JUST DO EXPLAIN */
-- QUERY PLAN /* TO DO EXPLAIN QUERY PLAN UNCOMMNET LINE ABOVE AS ABOVE and UNCOMMENT THIS SAME WAY */
SELECT Count(id) FROM trimmedt;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM trimmedt;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM trimmedt;
-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM trimmedt;

消息日志:-

DROP TABLE IF EXISTS t
> OK
> Time: 1.242s


CREATE TABLE t (id INTEGER PRIMARY KEY, a TEXT, b TEXT, c TEXT, deleted INTEGER DEFAULT 0)
> OK
> Time: 0.103s


WITH 
    alphabet(letters) AS (SELECT 'ABCDEFGHIJKLMNOPQRSTUVWXYZ'),
    cte1(counter,deleted) AS (SELECT 1, abs(random()) % 2  UNION ALL SELECT counter+1, abs(random()) % 2 FROM cte1 LIMIT 300000)
INSERT INTO t (deleted,a,b,c) 
    SELECT deleted,
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1),
        substr((SELECT letters FROM alphabet),(abs(random()) % ((SELECT max(length(letters)) FROM alphabet)-10)) + 1,(abs(random()) % 10) + 1)
    FROM cte1
> Affected rows: 300000
> Time: 0.722s


-- SELECT * FROM t; /* COMMENT/UNCOMMENT WITH/WITHOUT -- To HIDE/SHOW TABLE */ 

CREATE TEMP TABLE trimmedt AS SELECT * FROM t WHERE deleted = 0
> OK
> Time: 0.091s


-- EXPLAIN  /* UNCOMMENT BY REMOVING -- ON THIS LINE TO JUST DO EXPLAIN */
-- QUERY PLAN /* TO DO EXPLAIN QUERY PLAN UNCOMMNET LINE ABOVE AS ABOVE and UNCOMMENT THIS SAME WAY */
SELECT Count(id) FROM trimmedt
> OK
> Time: 0.009s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT a) FROM trimmedt
> OK
> Time: 0.03s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT b) FROM trimmedt
> OK
> Time: 0.03s


-- EXPLAIN 
-- QUERY PLAN
SELECT Count(DISTINCT c) FROM trimmedt
> OK
> Time: 0.031s

推荐阅读