首页 > 解决方案 > No-SQL (Cassandra) data modelling of user data

问题描述

How do you model user data in Cassandra?

  1. A single table for user data, partitioned by user-ID, with different components reading/writing to different columns?
  2. Multiple tables (one per component) with the same key structure, that occasionally need to be "joined" together on partition key?

We have various data and metadata associated with a customer, that we currently hold in separate tables with the same partitioning & clustering keys.

This leads to fething bits of information for a user from different tables (e.g. to analytics), effectively "joining" two or more Cassandra tables on their partition keys.

On the positive side, inserting to tables is done independently.

Is there a race condition when concurrently updating data under the same partition key but different columns? Or the deltas are gracefully merged on the SSTables?

Is having multiple tables with the same partition (and clustering) keys usual or an anti-pattern?

To make this more concrete, let's say we have:

CREATE TABLE example (
  pk text PRIMARY KEY
  col_a text
  col_b text
)

Assume that for a given partition key (pk), initially both col_a, and col_b have some value (i.e. not null). and two concurrent inserts update each of them. Is there any race condition possible there? Losing one of the two updates, despite writing into separate columns?

标签: cassandranosql

解决方案


概括

写冲突是你不应该担心的事情。所有 INSERTS/UPDATES/DELETES 在 Cassandra 中都是 Upserts。Cassandra 中的所有内容都是基于列的。

Cassandra 使用最后写入获胜策略来管理冲突。正如您在下面的示例中所看到的,每当您更改值时,与该列关联的时间戳都会更新。由于您正在运行并发更新,因此一个线程将更新col_a,另一个线程将更新col_b


例子

初始插入

cqlsh:test_keyspace> insert into race_condition_test (pk, col_a, col_b ) VALUES ( '1', 'deckard', 'Blade Runner');
cqlsh:test_keyspace> select * from race_condition_test ;

 pk | col_a   | col_b
----+---------+--------------
  1 | deckard | Blade Runner

(1 rows)

初始插入中的时间戳相同

cqlsh:test_keyspace> select pk, col_a, writetime(col_a), col_b, writetime(col_b) from race_condition_test ;

 pk | col_a   | writetime(col_a) | col_b        | writetime(col_b)
----+---------+------------------+--------------+------------------
  1 | Deckard | 1526916970412357 | Blade Runner | 1526916970412357

(1 rows)

一旦col_b更新,它的时间戳就会更改以反映更改。

cqlsh:test_keyspace> insert into race_condition_test (pk, col_b ) VALUES ( '1', 'Rick');
cqlsh:test_keyspace> select pk, col_a, writetime(col_a), col_b, writetime(col_b) from race_condition_test ;

 pk | col_a   | writetime(col_a) | col_b | writetime(col_b)
----+---------+------------------+-------+------------------
  1 | Deckard | 1526916970412357 |  Rick | 1526917272641682

(1 rows)

col_a更新后,它的时间戳也更新为新值

cqlsh:test_keyspace> insert into race_condition_test (pk, col_a) VALUES ( '1', 'bounty hunter');
cqlsh:test_keyspace> select pk, col_a, writetime(col_a), col_b, writetime(col_b) from race_condition_test ;

 pk | col_a         | writetime(col_a) | col_b | writetime(col_b)
----+---------------+------------------+-------+------------------
  1 | bounty hunter | 1526917323082217 |  Rick | 1526917272641682

(1 rows)

推荐

我的建议是您使用一个满足您查询需求的表。如果您需要通过pk查询,则创建一个包含您需要的所有列的单个表。这样,您将拥有一个可以有效回读的宽行,作为单个查询的一部分。

您在选项 2 中描述的数据模型有点关系,并且对于 Cassandra 来说不是最佳的。您不能在 cassandra 中本地执行连接,您应该避免在客户端执行连接。

数据模式规则:

规则 1:在集群中均匀分布数据 您将需要创建一个分区键,以确保数据在集群中均匀分布并且您没有任何热点。

规则 2:最小化分区数量读取每个分区可能驻留在不同的节点中,因此您应该尝试创建一个场景,为了性能起见,您的查询理想地只在一个节点上进行。

规则 3:围绕您的查询建模

  1. 确定要支持的查询
  2. 创建一个满足您的查询的表(这意味着您应该为每个查询模式使用一个表)。
  3. 如果您需要支持更多查询模式,请将您的数据非规范化到为这些查询提供服务的其他表中。避免使用二级索引和物化视图,因为它们目前不稳定,当您开始增加集群时,第一个可能会产生重大的性能问题。

如果您想阅读更多有关此内容的信息,我建议您使用此 datastax 页面: Cassandra 数据建模的基本规则


推荐阅读