首页 > 解决方案 > 使用 GROUP BY 的 MySQL 查询非常慢

问题描述

我有一个使用以下模式的数据库:

CREATE TABLE IF NOT EXISTS `sessions` (
  `starttime` datetime NOT NULL,
  `ip` varchar(15) NOT NULL default '',
  `country_name` varchar(45) default '',
  `country_iso_code` varchar(2) default '',
  `org` varchar(128) default '',
  KEY (`ip`),
  KEY (`starttime`),
  KEY (`country_name`)
);

(实际的表包含更多列;我只包括了我查询的列。)引擎是 InnoDB。

如您所见,有 3 个索引 - on ipstarttimecountry_name

该表非常大 - 它包含大约 150 万行。我正在对它运行各种查询,试图提取一个月的信息(在下面的示例中为 2018 年 8 月)。

像这样的查询

SELECT
  UNIX_TIMESTAMP(starttime) as time_sec,
  country_iso_code AS metric,
  COUNT(country_iso_code) AS value
FROM
  sessions
WHERE
  starttime >= FROM_UNIXTIME(1533070800) AND
  starttime <= FROM_UNIXTIME(1535749199)
GROUP BY metric;

尽管 . 上没有索引,但速度相当慢但可以忍受(几十秒)country_iso_code

(忽略 ; 中的第一件事,SELECT我知道它似乎没有意义,但在使用查询结果的工具中是必需的。同样,忽略使用的,FROM_UNIXTIME()而不是日期字符串;这部分查询是自动生成的,我无法控制它。)

但是,像这样的查询

SELECT
  country_name AS Country,
  COUNT(country_name) AS Attacks
FROM
  sessions
WHERE
  starttime >= FROM_UNIXTIME(1533070800) AND
  starttime <= FROM_UNIXTIME(1535749199)
GROUP BY Country;

慢得让人难以忍受——我让它运行了大约半个小时,然后就放弃了,没有得到任何结果。

结果来自EXPLAIN

+----+-------------+----------+------------+-------+------------------------------------+--------------+---------+------+----------+----------+-------------+
| id | select_type | table    | partitions | type  | possible_keys                      | key          | key_len | ref  | rows     | filtered | Extra       |
+----+-------------+----------+------------+-------+------------------------------------+--------------+---------+------+----------+----------+-------------+
|  1 | SIMPLE      | sessions | NULL       | index | starttime,starttime_2,country_name | country_name | 138     | NULL | 14771687 |    35.81 | Using where |
+----+-------------+----------+------------+-------+------------------------------------+--------------+---------+------+----------+----------+-------------+

究竟是什么问题?我应该索引其他东西吗?starttime也许是 ( , )上的复合索引country_name?我已经阅读了本指南,但也许我误解了它?

以下是其他一些同样缓慢且可能遇到相同问题的查询:

查询 #2:

SELECT
  ip AS IP,
  COUNT(ip) AS Attacks
FROM
  sessions
WHERE
  starttime >= FROM_UNIXTIME(1533070800) AND
  starttime <= FROM_UNIXTIME(1535749199)
GROUP BY ip;

结果来自EXPLAIN

+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+
| id | select_type | table    | partitions | type  | possible_keys            | key  | key_len | ref  | rows     | filtered | Extra       |
+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+
|  1 | SIMPLE      | sessions | NULL       | index | starttime,ip,starttime_2 | ip   | 47      | NULL | 14771780 |    35.81 | Using where |
+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+

查询 #3:

SELECT
  org AS Organization,
  COUNT(org) AS Attacks
FROM
  sessions
WHERE
  starttime >= FROM_UNIXTIME(1533070800) AND
  starttime <= FROM_UNIXTIME(1535749199)
GROUP BY Organization;

结果来自EXPLAIN

+----+-------------+----------+------------+-------+---------------------------+------+---------+------+----------+----------+-------------+
| id | select_type | table    | partitions | type  | possible_keys             | key  | key_len | ref  | rows     | filtered | Extra       |
+----+-------------+----------+------------+-------+---------------------------+------+---------+------+----------+----------+-------------+
|  1 | SIMPLE      | sessions | NULL       | index | starttime,starttime_2,org | org  | 387     | NULL | 14771800 |    35.81 | Using where |
+----+-------------+----------+------------+-------+---------------------------+------+---------+------+----------+----------+-------------+

查询 #4:

SELECT
  ip AS IP,
  country_name AS Country,
  city_name AS City,
  org AS Organization,
  COUNT(ip) AS Attacks
FROM
  sessions
WHERE
  starttime >= FROM_UNIXTIME(1533070800) AND
  starttime <= FROM_UNIXTIME(1535749199)
GROUP BY ip;

结果来自EXPLAIN

+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+
| id | select_type | table    | partitions | type  | possible_keys            | key  | key_len | ref  | rows     | filtered | Extra       |
+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+
|  1 | SIMPLE      | sessions | NULL       | index | starttime,ip,starttime_2 | ip   | 47      | NULL | 14771914 |    35.81 | Using where |
+----+-------------+----------+------------+-------+--------------------------+------+---------+------+----------+----------+-------------+

标签: mysqlaggregate-functionsquery-performance

解决方案


一般来说,表格的查询

  SELECT column, COUNT(column)
    FROM tbl
   WHERE datestamp >= a AND datestamp <= b
   GROUP BY column

当表在 上具有复合索引时性能最佳(datestamp, column)。为什么?它们可以通过索引扫描来满足,而不是需要读取表的所有行。

换句话说,可以通过随机访问索引(到日期戳的第一个值)来定位查询的第一个相关行。然后,MySQL 可以顺序读取索引并计算其中的各种值,column直到它到达最后一个相关行。无需阅读实际表格;仅从索引就可以满足查询。这使它更快。

UPDATE TABLE tbl ADD INDEX date_col (datestamp, column);

为您创建索引。

当心两件事。一:单列索引不一定有助于聚合查询性能。

二:在不查看整个查询的情况下,很难猜出用于进行索引扫描的正确索引。简化的查询通常会导致过度简化的索引。


推荐阅读