c# - Fetching large loads of data using paging
问题描述
Let's for instance say I have a Cloud
environment and a Client
environment and I want to sync a large amount of data from the cloud to the client. Let's say I have a db table in the cloud named Files
and i want the exact identical table to exist in the client environment.
Now let assume a few things:
- The files table is very big.
- The data of each row in files can be updated at any time and has a
last-update
column. - I want to fetch the delta's and make sure I am identical in both environments.
My solution:
- I make a full sync first, returning all the entries to the client.
- I keep the
LastSync
time in the client environment and keep syncing delta's from theLastSync
time. - I do the full sync and the delta syncs using paging: the client will fire a first request for getting the
Count
of results for the delta and as many other requests needed by thePage Size
of each request.
For example, the count:
SELECT COUNT(*) FROM files WHERE last_update > @LastSyncTime
The page fetching:
SELECT col1, col2..
FROM files
WHERE last_update > @LastSyncTime
ORDER BY files.id
LIMIT @LIMIT
OFFSET @OFFSET
My problem:
What if for example the first fetch(the Count
fetch) will take some time(few minutes for example) and in this time more entries have been updated and added to the last-update
fetch.
For example:
- The Count fetch gave 100 entries for
last-update 1000 seconds
. - 1 entry updated while fetching the
Count
. - Now the
last-update 1000 seconds
will give 101 entries. - The page fetch will only get 100 entries from the 101 with order by
id
- 1 entry is missed and not synced to the client
I have tried 2 other options:
- Syncing with
from-to
date limit forlast-update
. - Ordering by
last-update
instead of theid
column.
I see issues in both options.
解决方案
不要使用
OFFSET
andLIMIT
; 它从好到慢到慢。相反,请跟踪“您离开的地方”,last_update
以便提高效率。 更多讨论由于可能存在日期时间的重复,因此请灵活选择一次执行多少行。
不断地运行它。除非作为“保持活动”,否则不要使用 cron。
不需要初始副本;此代码为您完成。
拥有至关重要
INDEX(last_update)
这是代码:
-- Initialize. Note: This subtract is consistent with the later compare.
SELECT @left_off := MIN(last_update) - INTERVAL 1 DAY
FROM tbl;
Loop:
-- Get the ending timestamp:
SELECT @cutoff := last_update FROM tbl
WHERE last_update > @left_off
ORDER BY last_update
LIMIT 1 OFFSET 100; -- assuming you decide to do 100 at a time
-- if no result, sleep for a while, then restart
-- Get all the rows through that timestamp
-- This might be more than 100 rows
SELECT * FROM tbl
WHERE last_update > @left_off
AND last_update <= @cutoff
ORDER BY last_update
-- and transfer them
-- prep for next iteration
SET @left_off := @cutoff;
Goto Loop
SELECT @cutoff
会很快——它是对索引中 100 个连续行的简短扫描。
SELECT *
做繁重的工作,并且花费的时间与行数成正比——没有额外的开销OFFSET
。读取 100 行大约需要 1 秒(假设旋转磁盘、非缓存数据)。
而不是最初得到COUNT(*)
,我会先得到,MAX(last_update)
因为其余的代码都是基于last_update
. 这个查询是“即时的”,因为它只需要探测索引的末尾。但我声称你甚至不需要那个!
一个可能的错误:如果可以删除“源”中的行,您如何识别?
推荐阅读
- ios - Fabric crashlytics 显示错误:致命异常:NSInternalInconsistencyException。在 IOS 应用程序中
- r - 如何从因子存储中提取特定值作为数据框中的列值
- kubernetes - kubelet 错误:更新节点租约失败
- javascript - XMLHttpRequest | 手动发送请求会得到 2111 个字符的响应,但使用 XMLHttpRequest 响应的长度为 179
- python-3.x - SyntaxError:python套接字编程中listen()的语法无效
- javascript - Reactjs如何从另一个JavaScript文件导入常量数组
- ios - 枚举案例“案例”不是“类型”类型的成员
- jfreechart - jfreechart 从 0.99 升级到 1.0.19 时饼图颜色为红色和蓝色
- javascript - 在反应中使用映射函数时如何使用if语句插入索引?
- python - 避免在熊猫滚动中申请“滚动cummax”