首页 > 解决方案 > 检查数据框中的特定行组中是否存在值

问题描述

我有这个数据框(df):

structure(list(from = c("(192) 242-2345", NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "(832) 345-3168", 
NA, NA), to = c("(900) 301-3451", NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, "(900) 234-1231", 
NA, NA), action_result = c("Voicemail", "No Answer", "No Answer", 
"No Answer", "No Answer", "No Answer", "No Answer", "No Answer", 
"No Answer", "IP Phone Offline", "No Answer", "No Answer", "Busy", 
"Busy", "No Answer", "No Answer", "No Answer", "No Answer", "No Answer", 
"No Answer", "No Answer", "Busy", "IP Phone Offline", "Busy", 
"No Answer", "No Answer", "No Answer", "No Answer", "No Answer", 
"IP Phone Offline", "IP Phone Offline", "No Answer", "No Answer", 
"IP Phone Offline", "No Answer", "No Answer", "Busy", "Missed", 
"Hang Up", "Hang Up")), class = c("spec_tbl_df", "tbl_df", "tbl", 
"data.frame"), row.names = c(NA, -40L))

第一行包含拨打电话和接听电话的电话号码。然后,第一行之后的行都是 NA 值。所以第 1-37 行被认为是一组,然后第 38 到 40 行被认为是第二组。我想检查每个组是否包含列中的Call Connectedaction_result

我已经尝试group_byfromandto值,但是我正在使用的整个数据集都有重复的fromandto值对,所以这不起作用。我想要一个dplyr解决方案来检查前 37 行是否包含Call Connected并输出带有列的数据框:

from, to,CallConnected其中CallConnected1 表示是,0 表示否。

S0,看着df,结果数据集将有 2 行:

from          | to           | CallConnected
--------------------------------------------------
(192) 242-2345|(900) 301-3451| 0
(832) 345-3168|(900) 234-1231| 0

标签: rdataframedplyr

解决方案


使用tidyverse包的解决方案,或者您可以只加载dplyrtidyr包来实现此目的。

这个想法是在and列中填充NA最接近的非 NA 值。之后,用来检查是否有项目匹配, group by和,with来统计总匹配记录。fromtoaction_result == "CallConnected""CallConnected"fromtosummarizesum

library(tidyverse)

df2 <- df %>%
  fill(from) %>%
  fill(to) %>%
  mutate(CallConnected = action_result == "CallConnected") %>%
  group_by(from, to) %>%
  summarize(CallConnected = sum(CallConnected)) %>%
  ungroup()
df2
# # A tibble: 2 x 3
#   from           to             CallConnected
#   <chr>          <chr>                  <int>
# 1 (192) 242-2345 (900) 301-3451             0
# 2 (832) 345-3168 (900) 234-1231             0

更新

如果重复是一个问题,我们可以使用rleidfrom data.tablepackage 在fill函数之后创建 ID。下面是一个例子。

library(tidyverse)
library(data.table)

# Create an example with duplication
df_dup <- bind_rows(df, df %>% slice(1:5))

df_dup2 <- df_dup %>%
  fill(from) %>%
  fill(to) %>%
  mutate(ID = rleid(from, to)) %>%
  mutate(CallConnected = action_result == "CallConnected") %>%
  group_by(ID, from, to) %>%
  summarize(CallConnected = sum(CallConnected)) %>%
  ungroup()
df_dup2
# # A tibble: 3 x 4
#      ID from           to             CallConnected
#   <int> <chr>          <chr>                  <int>
# 1     1 (192) 242-2345 (900) 301-3451             0
# 2     2 (832) 345-3168 (900) 234-1231             0
# 3     3 (192) 242-2345 (900) 301-3451             0

推荐阅读