首页 > 解决方案 > Identify interrupted observations

问题描述

I would like to identify missing observations that suggest cleaning/data errors.

My dataframe consists of many accounts over many years. Here are the rules it follows:

In the data below, Accounts A-E show an amount for years 2001-2004.

df <- tribble(
       ~account,    ~"2001",     ~"2002",    ~"2003",  ~"2004",
       "Account.A",     100,          90,         87,       80,  #<Good   
       "Account.B",       0,          20,         30,       33,  #<Good
       "Account.C",      50,          55,          0,       0,   #<Good
       "Account.D",     200,         210,         NA,       210, #<Bad
       "Account.E",     150,           0,        212,       211) #<Bad

Account A,B,C show good data:

Account D and E show bad data:

My goal is to identify interrupted lines (D,E) and tag them.

I would like a solution that could be generalized across many years and thousands of accounts.

标签: rdplyr

解决方案


这是一个tidyverse可能不是最漂亮的选项,但应该可以解决问题:

library(tidyverse)
df %>% 
  gather(year, value, `2001`:`2004`) %>% 
  group_by(account) %>% 
  mutate(order = if_else(year == min(year), 'first', 
                         if_else(year == max(year), 'last', 'mid'))) %>% 
  mutate(value = replace(value, is.na(value), 0)) %>% 
  mutate(start0 = row_number() >= min(row_number()[value != 0]),
         end0 = row_number() <= max(row_number()[value != 0])) %>% 
  mutate(check = if_else(order == 'mid' & value == 0 & start0 == TRUE & end0 == TRUE, TRUE, FALSE)) %>% 
  filter(check == TRUE)

# A tibble: 2 x 7
# Groups:   account [2]
  account   year  value order start0 end0  check
  <chr>     <chr> <dbl> <chr> <lgl>  <lgl> <lgl>
1 Account.E 2002      0 mid   TRUE   TRUE  TRUE 
2 Account.D 2003      0 mid   TRUE   TRUE  TRUE 

这是一个解释:

  • 将数据从宽转换为长。
  • 按组确定帐户条目是其历史记录中的第一个、中间还是最后一个条目。
  • 因为零和 NA 的处理方式相同,所以将 NA 替换为零以使其更易于使用,但它们可以保持原样并更新代码以处理它们。
  • 添加 TRUE/FALSE 列以判断 0 值序列是从帐户历史记录的开头还是结尾运行。
  • 如果帐户为 0,不是第一个或最后一个条目,也不是从帐户历史的开头或结尾运行的 0 序列的一部分,则将该帐户标记为 TRUE 以进行检查。
  • 最后,有一个过滤器只筛选需要检查的帐户。

推荐阅读