首页 > 解决方案 > dplyr,如何根据代码对观察结果进行分组,计算并创建汇总变量,然后根据组内的名称添加一个新变量

问题描述

我有多个地址,我想将它们组合在一起并为其创建一个计数。但是,它们的格式有所不同。我已经对地址进行了地理编码,并计划使用地理编码对它们进行分组,但是在对它们进行分组时,我想创建一个新变量,该变量至少保留一个版本的地址(或多个变量,组中的每个地址都采用宽格式,但是我会为每个组选择一个变量,并保留一个地址)。

这是一些示例数据。

address=c("big fake plaza, 12 this street,district, city", 
"Green mansion, district, city", 
 "Block 7 of orange building  district, city",
"98 main street block a blue plaza, city",
 "blue red mansion, 46 pearl street, city",
"12 this street, big fake plaza, district, city", 
"Green mansion, district, city", 
 "orange building Block 7 district, city",
"block a 98 main street blue plaza, city",
 "blue red mansion, 46 pearl street, city"
"big fake plaza, district, city", 
"Green mansion,city")

long =c("112.8838",  "111.9154", "114.9318",  "116.9318", "112.9320","111.9324",
"112.8838",  "111.9154", "114.9318",  "116.9318", "112.9320","111.9324",
"112.8838",  "111.9154")

lat = c("21.22177", "12.22177", "26.27743", "23.17651", "23.24769", "23.24771",
"21.22177", "12.22177", "26.27743", "23.17651", "23.24769", "23.24771",
"21.22177", "12.22177")

df<-cbind(address, lat, long)

我必须进行分组和计数,但不知道如何仅根据组中的一个地址来变异和创建命名变量。

df_agg<- df %>% 
  group_by(long,lat) %>%
  summarise(count = n()) %>%
  mutate(bldg = ifelse(address[address==1],address, NA )) ???????

我希望它看起来像这样

  long  lat  count    bldg
   <dbl> <dbl> <int>   <chr>
 1  112.  21.2     3    "big fake plaza, 12 this street,district, city"
 2  114.  12.2     3    "Green mansion, district, city"
 3  116.  26.3     2    "98 main street block a blue plaza, city"
 4  112.  23.5     2    "Block 7 of orange building  district, city"
 5  111.  23.5     2    "blue red mansion, 46 pearl street, city"

显然,我们不能对地址名称进行分组,因为字符串之间存在差异。如果有更好的选择,很高兴听到任何其他建议。如果我们可以创建新变量 bldg1 blgd2 等。对于每个组中的每个不同的建筑物名称,这很重要,但不是优先事项。

提前致谢。

标签: rgroup-bydplyrsummarize

解决方案


您可以选择每个位置的第一个地址。

library(dplyr)
library(tidyr)

df %>% 
  group_by(long,lat) %>%
  summarise(count = n(), 
            address = first(address)) %>%
  ungroup

#  long     lat      count address                                       
#  <chr>    <chr>    <int> <chr>                                         
#1 111.9154 12.22177     3 Green mansion, district, city                 
#2 111.9324 23.24771     2 12 this street, big fake plaza, district, city
#3 112.8838 21.22177     3 big fake plaza, 12 this street,district, city 
#4 112.9320 23.24769     2 blue red mansion, 46 pearl street, city       
#5 114.9318 26.27743     2 Block 7 of orange building  district, city    
#6 116.9318 23.17651     2 98 main street block a blue plaza, city      

如果您想创建单独的列,例如等bldg1bldg2则以宽格式转换数据。

df %>% 
  group_by(long,lat) %>%
  mutate(row = paste0('bldg', row_number()), 
         count = n()) %>%
  ungroup %>%
  pivot_wider(names_from = row, values_from = address)

推荐阅读