首页 > 解决方案 > Fetching large loads of data using paging

问题描述

Let's for instance say I have a Cloud environment and a Client environment and I want to sync a large amount of data from the cloud to the client. Let's say I have a db table in the cloud named Files and i want the exact identical table to exist in the client environment.

Now let assume a few things:

  1. The files table is very big.
  2. The data of each row in files can be updated at any time and has a last-update column.
  3. I want to fetch the delta's and make sure I am identical in both environments.

My solution:

  1. I make a full sync first, returning all the entries to the client.
  2. I keep the LastSync time in the client environment and keep syncing delta's from the LastSync time.
  3. I do the full sync and the delta syncs using paging: the client will fire a first request for getting the Count of results for the delta and as many other requests needed by the Page Size of each request.

For example, the count:

SELECT COUNT(*) FROM files WHERE last_update > @LastSyncTime

The page fetching:

SELECT col1, col2..
FROM files 
WHERE last_update > @LastSyncTime
ORDER BY files.id
LIMIT @LIMIT 
OFFSET @OFFSET

My problem:

What if for example the first fetch(the Count fetch) will take some time(few minutes for example) and in this time more entries have been updated and added to the last-update fetch.

For example:

I have tried 2 other options:

I see issues in both options.

标签: c#mysqlpagingdata-paging

解决方案


  • 不要使用OFFSETand LIMIT; 它从好到慢到慢。相反,请跟踪“您离开的地方”,last_update以便提高效率。 更多讨论

  • 由于可能存在日期时间的重复,因此请灵活选择一次执行多少行。

  • 不断地运行它。除非作为“保持活动”,否则不要使用 cron。

  • 不需要初始副本;此代码为您完成。

  • 拥有至关重要INDEX(last_update)

这是代码:

-- Initialize.  Note: This subtract is consistent with the later compare. 
SELECT @left_off := MIN(last_update) - INTERVAL 1 DAY
    FROM tbl;

Loop:

    -- Get the ending timestamp:
    SELECT @cutoff := last_update FROM tbl
         WHERE last_update > @left_off
         ORDER BY last_update
         LIMIT 1  OFFSET 100;   -- assuming you decide to do 100 at a time
    -- if no result, sleep for a while, then restart

    -- Get all the rows through that timestamp
    -- This might be more than 100 rows
    SELECT * FROM tbl
        WHERE last_update > @left_off
          AND last_update <= @cutoff
        ORDER BY last_update
    -- and transfer them

    -- prep for next iteration
    SET @left_off := @cutoff;

Goto Loop

SELECT @cutoff会很快——它是对索引中 100 个连续行的简短扫描。

SELECT *做繁重的工作,并且花费的时间与行数成正比——没有额外的开销OFFSET。读取 100 行大约需要 1 秒(假设旋转磁盘、非缓存数据)。

而不是最初得到COUNT(*),我会先得到,MAX(last_update)因为其余的代码都是基于last_update. 这个查询是“即时的”,因为它只需要探测索引的末尾。但我声称你甚至不需要那个!

一个可能的错误:如果可以删除“源”中的行,您如何识别?


推荐阅读