首页 > 解决方案 > 具有不同度量的 R 重复

问题描述

我正在处理一个凌乱的人口普查数据集,其中variable列 (high schooluniversity) 中有重复项,但这些重复项实际上测量的结果略有不同。列中数字较高的度量是count15 岁及以上 ( highest_educ_15_over) 的总数。较低的数字始终是 24-65 岁的最高教育(highest_educ_24_65)。这是视觉对象的数据。

data <- tribble(
  ~town, ~variable, ~count,
  "A","highest_educ_15_over",100,
  "A","high school",80,
  "A","university",20,
  "A","highest_educ_24_65",50,
  "A","high school",40,
  "A","university", 10,
  "B","highest_educ_15_over",1000,
  "B","high school", 800,
  "B",   "university", 200,
  "B",  "highest_educ_24_65", 500,
  "B", "high school", 400,
  "B", "university", 100)

我不能简单地过滤高中或大学,因为它返回两个值。最终我不希望数据集看起来像这样:

tribble(
  ~town, ~highest_educ_15_over, ~highschool, ~university,
  "A", "100","80","20",
  "B", "1000","800","200"
  )

我自动为每个城镇和相应的总分母取最高high schooluniversity

关于如何解决这个问题的任何想法?

标签: rdplyrtidyrdata-cleaning

解决方案


我们可以按顺序进行分组,然后将其pivot_wider更改为“宽”格式

library(dplyr)
library(tidyr)
data %>% 
  group_by(town, variable) %>% mutate(rn = row_number()) %>% 
  pivot_wider(names_from = variable, values_from = count)  %>% 
  filter_at(3:ncol(.), all_vars(!is.na(.))) %>%
  select(-rn)

推荐阅读