首页 > 解决方案 > 使用滞后功能删除基于微小差异的重复数据行

问题描述

电信公司根据冲动收集电话数据,例如手机信号塔或网络变化。因此,如果唯一的不同点可能是轻微的时间差异或存在手机信号塔,则构建了不需要的重复项,并且任务是将它们取出。这里给出一个例子:

这是收集的手机数据示例,日期、电话 1 和电话 2 相同。不同的是有时是时间,有时是“from_Tower”数据的存在。

SELECT '4/04/2020'  as [Date],  '7:03:46'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2, '1' as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:03:46'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2, '' as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:03:48'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2, '5' as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:03:48'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2, ''as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:03:48'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2, '.'as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:34:33'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2,'' as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:34:34'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2,'' as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:37:34'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2,'' as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:37:34'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2 ,'-' as From_Tower 

我怎样才能删除数据,只剩下 4 行:

SELECT '4/04/2020'  as [Date],  '7:03:46'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2, '1' as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:03:48'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2, '5' as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:34:34'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2,'' as From_Tower UNION
SELECT '4/04/2020'  as [Date],  '7:37:34'   as Time,    '6123456789'    as Phone1,  '6987654321'as Phone2 ,'-' as From_Tower 

当有数百个重复的 Phone1 号码时,我可以使用max(time)... where from_Tower is not null; 如何使用 max 函数。当时使用数组是否有用,如何做到这一点?有没有其他更好的方法可以干净地做到这一点?

下面的部分有启发性的想法,这让我想到,解决这个问题的最好方法可能是时间的 LAG 函数,只要时间超过 2 秒,就会跳过数据。所以我创建了这段代码:

       [CTE_dedup] as(

 SELECT [Date],[Time], [Phone1],  [Phone2], [From_Tower], 
  LAG([To_A])
      OVER (PARTITION BY [Time] ORDER BY [Date]) AS Previous_count, 
      [Time] - LAG([Time])
      OVER (PARTITION BY [Time]ORDER BY [Date] ) AS Timedifference
   
    FROM [CTE_set])


    select * from [CTE_dedup]
    WHERE Timedifference >2

但我得到这个错误:

Operand data type time is invalid for subtract operator.

时间导入为:

time (time(7), null

为什么我不能相互减去时间字段?

标签: sqlsql-servertsql

解决方案


这是答案:如何在 SQL Server 中将时间转换为整数

      WITH [CTE_set_sceen] AS  
    (
   SELECT [Date],[Time], [Phone1],  [Phone2], [From_Tower] 
    ,CONVERT(INT, REPLACE(CONVERT(VARCHAR(8),  [Time_n] , 108), ':','')) AS [IntVersion]
    FROM [data] 
   ),

   [CTE_dedup] as(

 SELECT[Date],[Time], [Phone1], [Phone2], [From_Tower], 
     [IntVersion] - LAG([IntVersion])
      OVER (PARTITION BY [Date] ORDER BY [Time] ) AS Timedifference

    FROM [CTE_set_sceen]
)


    select * from [CTE_dedup]
    WHERE Timedifference >2

推荐阅读