r - Identify interrupted observations
问题描述
I would like to identify missing observations that suggest cleaning/data errors.
My dataframe consists of many accounts over many years. Here are the rules it follows:
- Accounts may be created or terminated. In these cases, the amount is either $0 or NA. Such observations are (probably) not the result of bad data.
- Accounts interrupted by a NA or $0 are probably the result of bad data or cleaning errors.
In the data below, Accounts A-E show an amount for years 2001-2004.
df <- tribble(
~account, ~"2001", ~"2002", ~"2003", ~"2004",
"Account.A", 100, 90, 87, 80, #<Good
"Account.B", 0, 20, 30, 33, #<Good
"Account.C", 50, 55, 0, 0, #<Good
"Account.D", 200, 210, NA, 210, #<Bad
"Account.E", 150, 0, 212, 211) #<Bad
Account A,B,C show good data:
- Account A shows uninterrupted data
- Account B shows an account that began in 2002.
- Account C shows an account that expired in 2003 and remained $0 ever after.
Account D and E show bad data:
- Account D shows an account interupted in 2003
- Account E shows an account interrupted in 2002
My goal is to identify interrupted lines (D,E) and tag them.
I would like a solution that could be generalized across many years and thousands of accounts.
解决方案
这是一个tidyverse
可能不是最漂亮的选项,但应该可以解决问题:
library(tidyverse)
df %>%
gather(year, value, `2001`:`2004`) %>%
group_by(account) %>%
mutate(order = if_else(year == min(year), 'first',
if_else(year == max(year), 'last', 'mid'))) %>%
mutate(value = replace(value, is.na(value), 0)) %>%
mutate(start0 = row_number() >= min(row_number()[value != 0]),
end0 = row_number() <= max(row_number()[value != 0])) %>%
mutate(check = if_else(order == 'mid' & value == 0 & start0 == TRUE & end0 == TRUE, TRUE, FALSE)) %>%
filter(check == TRUE)
# A tibble: 2 x 7
# Groups: account [2]
account year value order start0 end0 check
<chr> <chr> <dbl> <chr> <lgl> <lgl> <lgl>
1 Account.E 2002 0 mid TRUE TRUE TRUE
2 Account.D 2003 0 mid TRUE TRUE TRUE
这是一个解释:
- 将数据从宽转换为长。
- 按组确定帐户条目是其历史记录中的第一个、中间还是最后一个条目。
- 因为零和 NA 的处理方式相同,所以将 NA 替换为零以使其更易于使用,但它们可以保持原样并更新代码以处理它们。
- 添加 TRUE/FALSE 列以判断 0 值序列是从帐户历史记录的开头还是结尾运行。
- 如果帐户为 0,不是第一个或最后一个条目,也不是从帐户历史的开头或结尾运行的 0 序列的一部分,则将该帐户标记为 TRUE 以进行检查。
- 最后,有一个过滤器只筛选需要检查的帐户。
推荐阅读
- visual-studio-code - “ANY_NODE_APP”在 Windows 10 上不是内部或外部命令、可运行程序或批处理文件
- sql - 有没有办法使用 SQL 将字符串转换为列名?
- python - 进度条 - 类型错误:“模块”对象不可调用
- python - KeyError: 'mae' 绘制 Keras 模型训练进度时
- tensorflow - Tensorflow - 使用 GPU 时 CPU 使用率高
- geoserver - Geoserver 启动时间长
- c# - 运行代理场景时,“期望用双引号括起来的属性名称”
- r - 如何多次进行简单回归?
- tensorflow - 为什么一个简单的操作会在 TensorFlow 上占用这么多内存?
- bash - 在 bash 脚本的变量中获取 kubectl 命令错误消息