首页 > 解决方案 > Group_by and summarize behave strangely and do not provide expected results

问题描述

While having used dplyr before, I've run into problems that I do not sufficiently understand at the moment.

The part of a research data set I am working with has +2500 different rows. These rows are different respondents of 515 houses from a study.

I want to summarize the number of years the respondent has spent in school (column [, 7]) and group it by the house id (column [, 26]). Average for all of the school years is 3.65 (sample was taken in Uganda).

Now, when I run the following code:

library(dplyr)
df_house %>%
  dplyr::group_by(House = df_house[, 26]) %>%
  dplyr::summarise(Avg_school = mean(df_house[,7], na.rm = TRUE))

I get the following result:

A tibble: 510 x 2
   House Avg_school
   <dbl>      <dbl>
 1     1       3.65
 2     2       3.65
 3     3       3.65
 4     4       3.65
 5     5       3.65
 6     6       3.65
 7     7       3.65
 8     8       3.65
 9     9       3.65
10    10       3.65
# ... with 500 more rows

I have two issues with this: First, obviously summarize does not summarize over the mean of each house_id. Second, I only get 510 groups instead of the expected 515 different houses.

I have looked at the class() and typeof() functions to make sure that they are both numeric and double.

Has anybody any idea why group_by and summarize behave that way?

标签: rdplyrgroup-bysummarize

解决方案


Right answer was provided by @Ronak Shah. It was indeed the use of the column numbers instead of the names that prevented it from working properly.


推荐阅读