首页 > 解决方案 > 使用 ORDER、LIMIT 和 IN 谓词进行查询的 Cassandra 表设计

问题描述

我的数据看起来像:

select * from test;

 department | employee | batch_number | hash
------------+----------+--------------+-------
 dep1       | Bart     |            1 | hash1
 dep1       | Bart     |            1 | hash2
 dep1       | Lisa     |            3 | hash3
 dep1       | Lisa     |            4 | hash4
 dep1       | John     |            5 | hash5
 dep1       | Lucy     |            6 | hash6
 dep1       | Bart     |            7 | hash7
 dep1       | Bart     |            7 | hash8

我想用一个where子句 on batch_number、一个orderingonbatch_number和一个in谓词 on来查询数据employee

在关系数据库中,这看起来像

select * from test 
  where department='dep1' 
  and employee in ('Bart','Lucy','John') 
  and batch_number >= 2 
  order by batch_number desc 
  limit 3;

 department | employee | batch_number | hash
------------+----------+--------------+-------
 dep1       | Bart     |            7 | hash7
 dep1       | Bart     |            7 | hash8
 dep1       | Lucy     |            6 | hash6

我在 Cassandra 中为这个查询建模表时遇到了一些问题。department将是我的分区键,并且hash需要成为主键的一部分。但我正在努力处理集群键和/或((SSTable 附加)二级)索引。

因为我想订购,batch_number所以我尝试将它作为集群键包含在内:

CREATE TABLE keyspace.test(
    department      TEXT,
    batch_number    INT,
    hash            TEXT,
    employee        TEXT,
    PRIMARY KEY ((department), batch_number, hash)
) WITH CLUSTERING ORDER BY (batch_number DESC);
CREATE INDEX tst_emp ON keyspace.test (employee);

但这不允许in在我的索引上使用谓词进行查询:

select * from keyspace.test where department='dep1' and employee in ('Bart','Lucy','John');
InvalidRequest: Error from server: code=2200 [Invalid query] message="IN predicates on non-primary-key columns (employee) is not yet supported"

所以我也尝试将该employee列添加为集群键:

CREATE TABLE keyspace.test(
    department      TEXT,
    batch_number    INT,
    hash            TEXT,
    employee        TEXT,
    PRIMARY KEY ((department), batch_number, hash, employee)
) WITH CLUSTERING ORDER BY (batch_number DESC);

但这失败了,因为我不能将非 EQ 关系放在batch_number

select * from keyspace.test where department='dep1' and batch_number > 1 and employee in ('Bart','Lucy','John');
InvalidRequest: Error from server: code=2200 [Invalid query] message="Clustering column "employee" cannot be restricted (preceding column "batch_number" is restricted by a non-EQ relation)"

但是每当我在失去订购能力employee之前放上:batch_numberbatch_number

CREATE TABLE keyspace.test(
    department      TEXT,
    employee        TEXT,
    batch_number    INT,
    hash            TEXT,
    PRIMARY KEY ((department), employee, batch_number, hash)
);

select * from keyspace.test where department='dep1' and employee in ('Bart','Lucy','John') ORDER BY batch_number DESC;
InvalidRequest: Error from server: code=2200 [Invalid query] message="Order by currently only support the ordering of columns following their declared order in the PRIMARY KEY"

那么什么样的表设计会允许这样的查询呢?这可以在 Cassandra 中完成吗?

编辑:

我希望能够在此表上运行的其他查询是:

select * from keyspace.test where department='X' and batch_number=Y 

delete from keyspace.test where department='X'

标签: sqldatabase-designcassandracql

解决方案


使用允许您重新排列数据的物化视图:

CREATE MATERIALIZED VIEW mv_test AS 
SELECT
   department,
   batch_number,
   employee,
   hash 
FROM
   test 
WHERE
   department IS NOT NULL 
   AND batch_number IS NOT NULL 
   AND employee IS NOT NULL 
   AND hash IS NOT NULL 
PRIMARY KEY (department, employee, batch_number, hash) 
WITH clustering 
ORDER BY
(batch_number DESC);

我可以执行以下查询:

SELECT * FROM mv_test 
WHERE
   department = 'dep1' 
   AND employee IN 
   (
      'Bart',
      'Lisa'
   )
   AND batch_number > 3;

结果按聚类顺序排序:

 department | employee | batch_number | hash
------------+----------+--------------+-------
       dep1 |     Bart |            7 | hash7
       dep1 |     Bart |            7 | hash8
       dep1 |     Lisa |            4 | hash4

While>子句是不相等的子句,IN虽然具有多个值,但仍然是确定性的,这就是为什么我相信您可以毫无问题地过滤键的原因。由于batch_number是您要过滤的最后一件事,因此允许使用任何类型的 where 子句。我假设你总是有department.

请注意,物化视图会影响性能。更具体地说,写入性能。但是,与ALLOW FILTERING.

更新:

在物化视图末尾指定的顺序表示batch_number,但是,它将首先排序department,然后employee,然后,因此不保证具体batch_number的顺序。batch_number据我所知,没有办法解决这个问题。另一种数据库解决方案可能更可取。

更新 2:

如 Apache 邮件链中所述(请参阅下面的评论),物化视图不被视为生产就绪。但是,Datastax 认为它们是可用的,前提是使用提到的最佳实践来处理它们。就个人而言,我对物化视图没有任何问题。当然,对于一个简单的单一数据中心集群,考虑到最佳实践提到更复杂的设置,它们可能会在这种情况下崩溃。


推荐阅读