首页 > 解决方案 > 为什么根据我在 R 中应用 group_by() 和 distinct() 的时间会得到不同的频率?

问题描述

我对 R 和 tidyverse 很陌生,我无法理解以下内容:

为什么我会根据我group_by()distinct()我的数据获得不同的频率?

不同的用户频率取决于何时应用 distict 和 group_by

output_df_1 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  group_by(created_at) %>%
  count(created_at)

output_df_2 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  group_by(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  count(created_at)

full_join(output_df_1 , output_df_2 , by = "created_at") %>%
  rename(output_df_1 = n.x,
         output_df_2 = n.y) %>%
  melt(id = "created_at") %>%
  ggplot()+
  geom_line(aes(x=created_at, y=value, colour=variable),
            linetype = "solid",
            size = 0.75) +
  scale_colour_manual(values=c("#005293","#E37222"))

语境

input_df 是一个数据框,其中包含对带有时间戳和 author_id 的推文的观察。我想生成一个 plot,其中 variable1 是每小时的推文(这没有问题),而 variable2 是每小时的 distict 用户。我不确定上面图中的两条线中的哪条线可以正确地显示每小时不同的用户。

标签: rdataframedplyrtidyverse

解决方案


  1. 这是因为在第一个代码中,您使用distinct了 beforegroup_bycount.

  2. Morover 它是使用group_by. count自动也分组: 与 .count相同group_by(cyl) %>% summarise(freq=n())

这是一个例子:

mtcars %>% 
  distinct(am, .keep_all=TRUE) %>%
  count(cyl)

mtcars %>% 
  distinct(am, .keep_all=TRUE) %>% 
  count(cyl)

给出:

> mtcars %>% 
+   distinct(am, .keep_all=TRUE) %>%
+   count(cyl)
  cyl n
1   6 2
> mtcars %>% 
+   distinct(am, .keep_all=TRUE) %>% 
+   count(cyl)
  cyl n
1   6 2

如果您更改以下顺序distinct

mtcars %>% 
  distinct(am, .keep_all=TRUE) %>% 
  count(cyl)

mtcars %>% 
  count(cyl) %>% 
  distinct(am, .keep_all=TRUE)

你得到:

 mtcars %>% 
+   distinct(am, .keep_all=TRUE) %>% 
+   count(cyl)
  cyl n
1   6 2
> 
> mtcars %>% 
+   count(cyl) %>% 
+   distinct(am, .keep_all=TRUE)
Error: `distinct()` must use existing variables.
x `am` not found in `.data`.

在您的示例中,此代码应为df1and给出相同的结果df2

output_df_1 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  count(created_at)



output_df_2 <- input_df %>%
  mutate(created_at = lubridate::floor_date(created_at, unit = "hours")) %>%
  select(created_at, author_id) %>%
  arrange(created_at) %>%
  distinct(author_id, .keep_all = T) %>%
  count(created_at)

推荐阅读