sql - 使用滞后功能删除基于微小差异的重复数据行
问题描述
电信公司根据冲动收集电话数据,例如手机信号塔或网络变化。因此,如果唯一的不同点可能是轻微的时间差异或存在手机信号塔,则构建了不需要的重复项,并且任务是将它们取出。这里给出一个例子:
这是收集的手机数据示例,日期、电话 1 和电话 2 相同。不同的是有时是时间,有时是“from_Tower”数据的存在。
SELECT '4/04/2020' as [Date], '7:03:46' as Time, '6123456789' as Phone1, '6987654321'as Phone2, '1' as From_Tower UNION
SELECT '4/04/2020' as [Date], '7:03:46' as Time, '6123456789' as Phone1, '6987654321'as Phone2, '' as From_Tower UNION
SELECT '4/04/2020' as [Date], '7:03:48' as Time, '6123456789' as Phone1, '6987654321'as Phone2, '5' as From_Tower UNION
SELECT '4/04/2020' as [Date], '7:03:48' as Time, '6123456789' as Phone1, '6987654321'as Phone2, ''as From_Tower UNION
SELECT '4/04/2020' as [Date], '7:03:48' as Time, '6123456789' as Phone1, '6987654321'as Phone2, '.'as From_Tower UNION
SELECT '4/04/2020' as [Date], '7:34:33' as Time, '6123456789' as Phone1, '6987654321'as Phone2,'' as From_Tower UNION
SELECT '4/04/2020' as [Date], '7:34:34' as Time, '6123456789' as Phone1, '6987654321'as Phone2,'' as From_Tower UNION
SELECT '4/04/2020' as [Date], '7:37:34' as Time, '6123456789' as Phone1, '6987654321'as Phone2,'' as From_Tower UNION
SELECT '4/04/2020' as [Date], '7:37:34' as Time, '6123456789' as Phone1, '6987654321'as Phone2 ,'-' as From_Tower
我怎样才能删除数据,只剩下 4 行:
SELECT '4/04/2020' as [Date], '7:03:46' as Time, '6123456789' as Phone1, '6987654321'as Phone2, '1' as From_Tower UNION
SELECT '4/04/2020' as [Date], '7:03:48' as Time, '6123456789' as Phone1, '6987654321'as Phone2, '5' as From_Tower UNION
SELECT '4/04/2020' as [Date], '7:34:34' as Time, '6123456789' as Phone1, '6987654321'as Phone2,'' as From_Tower UNION
SELECT '4/04/2020' as [Date], '7:37:34' as Time, '6123456789' as Phone1, '6987654321'as Phone2 ,'-' as From_Tower
当有数百个重复的 Phone1 号码时,我可以使用max(time)... where from_Tower is not null;
如何使用 max 函数。当时使用数组是否有用,如何做到这一点?有没有其他更好的方法可以干净地做到这一点?
下面的部分有启发性的想法,这让我想到,解决这个问题的最好方法可能是时间的 LAG 函数,只要时间超过 2 秒,就会跳过数据。所以我创建了这段代码:
[CTE_dedup] as(
SELECT [Date],[Time], [Phone1], [Phone2], [From_Tower],
LAG([To_A])
OVER (PARTITION BY [Time] ORDER BY [Date]) AS Previous_count,
[Time] - LAG([Time])
OVER (PARTITION BY [Time]ORDER BY [Date] ) AS Timedifference
FROM [CTE_set])
select * from [CTE_dedup]
WHERE Timedifference >2
但我得到这个错误:
Operand data type time is invalid for subtract operator.
时间导入为:
time (time(7), null
为什么我不能相互减去时间字段?
解决方案
WITH [CTE_set_sceen] AS
(
SELECT [Date],[Time], [Phone1], [Phone2], [From_Tower]
,CONVERT(INT, REPLACE(CONVERT(VARCHAR(8), [Time_n] , 108), ':','')) AS [IntVersion]
FROM [data]
),
[CTE_dedup] as(
SELECT[Date],[Time], [Phone1], [Phone2], [From_Tower],
[IntVersion] - LAG([IntVersion])
OVER (PARTITION BY [Date] ORDER BY [Time] ) AS Timedifference
FROM [CTE_set_sceen]
)
select * from [CTE_dedup]
WHERE Timedifference >2
推荐阅读
- selenium - AWS Lambda、Selenium、Headless Chrome 与 Python 3.8 不兼容?
- string - 在 SQL Server 2008 中转换为德语日期字符串
- python - marshal_with 用于 Flask RESTful 中的 SQLAlchemy 对象列表
- c++ - 在对象未更新时传递对常量对象的引用
- javascript - 更改事件在两个元素中不起作用
- python - 我想用这段代码生成随机数,但它会产生错误
- python - file_version_info.txt中的filevers/prodvers这四项是什么意思?
- android - 我如何实现这个设计?
- for-loop - 在创建活动之前检查日历是否有重复项
- sass - 当页面重新加载时,Gulp watch with Livereload 重新加载,或者自己