首页 > 解决方案 > Is it possible to stream to a BigQuery partitioned table while preserve caching?

问题描述

I have a single table in BigQuery time partitioned by day . Dataflow job uses streaming API to insert new records continuously, but only to the newest partitions (two in the corner case, when data comes slightly out of order on the border of days).

On the other side, I query the table a lot aggregating historical months, not touching the most recent days, i.e. the streaming buffer as well.

I would like to leverage the caching of the results of such queries. Streaming to the table unfortunately disables the cache, even though theoretically the cached results are not influenced by the streamed rows.

How do I use caching on historical partitions while still be able to stream to the newest partitions?

If it is impossible out of the box, is it a good design to:

If yes, how would I define such a view that will use caching if only "historical" data is queried? Or would I need to have my own query rewrite tool?

Maybe you have other ideas?

标签: google-bigquery

解决方案


您不能同时使用缓存和流式传输。已经有一个功能请求要求您提供相同的功能。

正如您所说,作为一种解决方法,您需要两个不同的表并使用数据冗余。我同意你发布的方法:

  1. 管理两个表:“最近”(流)和“历史”(查询)。
  2. 定期将“最近”合并到“历史”中,然后清除“最近”。

在此处查看如何“管理分区表”。有一个基于bq cp命令的用例列表,可以帮助您将“最近”表合并到“历史”表中。


推荐阅读