首页 > 解决方案 > 递归更新开始结束日期以避免重叠

问题描述

我正在处理 MySQL 中有关意大利工作合同的庞大数据库(行数约为 2000 万)。我的核心表的每一行代表与特定雇主的工人签订的特定合同。为了重建每个工人的工作历史,我在导入过程中对表进行索引时,按照工人的识别码和每份合同的开始日期对工人进行了排序。然后,每一行都有自己的渐进式 ID,但同时,我为每一行添加了两个字段,一个引用前一个 ID,另一个引用下一个。仅当前一个或后一个 ID 引用同一个工作人员时,这两个字段才有效地不为空。

我已经做了一个小例子来说明我的数据在这里的样子(或者,在下面的脚本中,我创建了一个可重现的小例子)。

工人的历史可能是什么样子

在此处输入图像描述

最后应该如何改变

在此处输入图像描述

我目前的任务是计算我桌子上每个人的有效工作天数。尽管如此,数据无疑具有巨大重叠的特点。毕竟,每个人可能有几个重叠的合同。例如,一份于 2010 年 1 月 1 日开始并于 2012 年 1 月 1 日结束的合同可能会紧随其后在 2012 年 1 月 1 日之前结束的其他几份较短的合同。因此,如果我计算这个人有效工作的天数,我可能会重复计算。出于这个原因,我想通过更改合同的结束日期来重新安排合同,以获得后续的不重叠合同。唯一可能的重叠可能是一天。我制作了一个图形示例,说明个人的工作历史可能是什么样子,以及我想如何在以下两张图片中重新排列它。

由于我无法修改每个合同/行的开始日期,我想通过根据之前的合同进行修改来处理每个合同的结束日期。

我按照以下步骤工作:

在下面,您可以找到我用于此过程的代码。

-- My example table (data_example.csv on GitHub)

drop table if exists mytable;
create table mytable
( 
id INT, 
WORKER_ID INT not null,
EMPLOYER_ID INT not null,
dt_start date not null, -- Contract start date
dt_end date, -- Contract end date
id_prev INT, -- ID of previous contract
dt_start_prev date, -- Start date of previous contract
dt_end_prev date, -- End date of previous contract
id_next INT, -- ID of next contract
dt_start_next date, -- Start date of next contract
dt_end_next date, -- End date of next contract
primary key(id)
);

insert into mytable 
(id, WORKER_ID, EMPLOYER_ID, dt_start, dt_end, 
id_prev, dt_start_prev, dt_end_prev, 
id_next, dt_start_next, dt_end_next)
values
    (1, 5157, 3384722, '2012-01-01', '2012-01-03', NULL, NULL, NULL, 2, '2012-01-02', '2012-01-04'),
    (2, 5157, 3384722, '2012-01-02', '2012-01-04', 1, '2012-01-01', '2012-01-03', 3, '2012-01-04', '2012-01-12'),
    (3, 5157, 96120, '2012-01-04', '2012-01-12', 2, '2012-01-02', '2012-01-04', 4, '2012-01-07', '2012-01-08'),
    (4, 5157, 3384722, '2012-01-07', '2012-01-08', 3, '2012-01-04', '2012-01-12', 5, '2012-01-08', '2012-01-10'),
    (5, 5157, 3384722, '2012-01-08', '2012-01-10', 4, '2012-01-07', '2012-01-08', 6, '2012-01-10', '2012-01-11'),
    (6, 5157, 3954093, '2012-01-10', '2012-01-11', 5, '2012-01-08', '2012-01-10', 7, '2012-01-12', '2012-01-15'),
    (7, 5157, 3384722, '2012-01-12', '2012-01-15', 6, '2012-01-10', '2012-01-11', 8, '2012-01-14', '2012-01-16'),
    (8, 5157, 3954093, '2012-01-14', '2012-01-16', 7, '2012-01-12', '2012-01-15', 9, '2012-01-14', '2012-01-14'),
    (9, 5157, 3384722, '2012-01-14', '2012-01-14', 8, '2012-01-14', '2012-01-16', 10, '2012-01-14', '2012-01-20'),
    (10, 5157, 96120, '2012-01-14', '2012-01-20', 9, '2012-01-14', '2012-01-14', NULL, NULL, NULL),
    (11, 5990, 1940957, '2012-01-01', '2012-01-30', NULL, NULL, NULL, 12, '2012-02-01', '2012-02-15'),
    (12, 5990, 4822105, '2012-02-01', '2012-02-15', 11, '2012-01-01', '2012-01-30', 13, '2012-02-10', '2012-02-10'),
    (13, 5990, 1940957, '2012-02-10', '2012-02-10', 12, '2012-02-01', '2012-02-15', 14, '2012-02-16', '2012-02-20'),
    (14, 5990, 1940957, '2012-02-16', '2012-02-20', 13, '2012-02-10', '2012-02-10', 15, '2012-02-17', '2012-02-28'),
    (15, 5990, 4822105, '2012-02-17', '2012-02-28', 14, '2012-02-16', '2012-02-20', NULL, NULL, NULL);

-- The following table counts the number of contracts for each individual
-- I will use it the determine the maximum number of contract per worker

drop table if exists max_act;
create table max_act 
as select WORKER_ID, count(*) n 
from mytable 
group by WORKER_ID;

set SQL_SAFE_UPDATES = 0;

-- Here I create the procedure

drop procedure if exists doiterate;
delimiter //

create procedure doiterate()
begin
  declare total INT unsigned DEFAULT 0;

  -- The number of iterations is equal to the maximum value in the table 'max_act'

  while total <= (select MAX(n) from max_act) do

  -- If the end date of the previous contract is greater than the end of the current contract
  -- the procedure sets the end date equal to the end date of the previous contract

  update mytable a
    set a.dt_end =
        case 
        when a.dt_end is NOT null and a.dt_end_prev > a.dt_end then a.dt_end_prev
        else a.dt_end end
    ;
    
  -- Here I update in each row the end date of the previous contract

  update mytable a
  left outer join mytable p on a.id_prev = p.id
    set a.dt_end_prev =
        case 
        when a.dt_end_prev is NOT null and a.dt_end_prev != p.dt_end then p.dt_end
        else a.dt_end_prev end
    ;
    
    set total = total + 1;
  end while;
end//  

delimiter ;

CALL doiterate(); 

-- Here I set the end date of each contract equal to the beginning of the next one if there is overlapping

update mytable a
    set a.dt_end =
        case 
        when a.dt_end is NOT null and a.dt_start_next < a.dt_end then a.dt_start_next
        else a.dt_end end
    ;

set SQL_SAFE_UPDATES = 1;

但是,我认为这个程序几乎是最佳的。我估计要花几天时间才能结束。如果有人能给我一些关于如何处理这个问题的提示,我将不胜感激。先感谢您。

标签: mysqloptimizationprocedure

解决方案


正如在一条评论中已经说过的,我尝试使用LAG()LEAD()函数按时间顺序连接个人的所有合同。然而,这个过程——也许是我的错——结果变得更慢了。因此,我只是决定只对那些实际上有至少两个重叠合同的工人运行该程序,这可能不是最好的解决方案(肯定不是在编码方面),但至少我能够执行该程序(我或多或少花了 1 天半的时间)。

-- Here I am identifying contracts with an overlapping previous contract  
alter table mytable add column flag_overlap INT default 0;
update mytable set flag_overlap = 1 where dt_end is NOT null and dt_end_prev > dt_end;

-- Creating a table with only those workers with at least two overlapping contracts
drop table if exists mytable_id;
create table mytable_id as select WORKER_ID 
from mytable where flag_overlap = 1 
group by WORKER_ID;

-- This is my table of interests with all the contracts for those workers identified in the previous step
drop table if exists mytable_mod;
create table mytable_mod
as select *
from mytable a
inner join mytable_id b on a.WORKER_ID = b.WORKER_ID 
order by WORKER_ID , dt_start;
alter table mytable_mod add unique index idx_ord_id(id);

-- The rest of the code is the same as the one posted in this question, 
-- simply I referred to the table 'mytable_mod' and no longer to 'mytable'. 

-- [...]

-- At the end I updated the 'revised' end date of my original table 'mytable'

UPDATE mytable a
left outer join mytable_mod b on a.ord_all = b.ord_all
set
    a.dt_end = b.dt_end ,
    a.dt_end_next = b.dt_end_next ,
    a.dt_end_prev = b.dt_end_prev 
    ;

推荐阅读