首页 > 解决方案 > R GGplot geom_area 数据可能无意重叠

问题描述

本周我正在处理 Tidy Tuesday 数据,并遇到我的 geom_area 做我认为与数据重叠的事情。如果我 facet_wrap 数据,那么任何一年都没有缺失值,但是一旦我制作了一个区域图并填充它,医疗保健/教育数据似乎就消失了。

下面是我的意思的示例图。

library(tidyverse)

chain_investment <- readr::read_csv('https://raw.githubusercontent.com/rfordatascience/tidytuesday/master/data/2021/2021-08-10/chain_investment.csv')

plottable_investment <- chain_investment %>% 
  filter(group_num == c(12,17)) %>% 
  mutate(small_cat = case_when(
    group_num == 12 ~ "Transportation",
    group_num == 17 ~ "Education/Health"
  )) %>% 
  group_by(small_cat, year, category) %>% 
  summarise(sum(gross_inv_chain)) %>% 
  ungroup %>% 
  rename(gross_inv_chain = 4)

# This plot shows that there is NO missing education, health, or highway data
# Goal is to combine the data on one plot and fill based on the category
plottable_investment %>% 
  ggplot(aes(year, gross_inv_chain)) +
    geom_area() +
    facet_wrap(~category)

# Some of the data in the health category gets lost? disappears? unknown
plottable_investment %>% 
  ggplot(aes(year, gross_inv_chain, fill = category)) +
    geom_area()

# Something is going wrong here?
plottable_investment %>% 
  filter(category %in% c("Education","Health")) %>% 
  ggplot(aes(year, gross_inv_chain, fill = category)) +
    geom_area(position = "identity")
  
# The data is definitely there
plottable_investment %>% 
  filter(category %in% c("Education","Health")) %>% 
  ggplot(aes(year, gross_inv_chain)) +
    geom_area() +
    facet_wrap(~category)

标签: rggplot2geom-area

解决方案


问题是您使用==而不是使用过滤数据%in%

在您的情况下,使用==具有微妙的副作用,对于某些类别(例如健康),您过滤的数据仅包含偶数年的 obs,而对于其他类别(例如教育),我们最终只包含不均匀年的 obs。结果,您最终会得到相互重叠的“两个”区域图。

这可以通过切换到geom_col哪个给你一个“躲避”的条形图来很容易地看到,因为我们每年只有一个类别。

plottable_investment %>% 
  filter(category %in% c("Education","Health")) %>% 
  ggplot(aes(year, gross_inv_chain, fill = category)) +
  geom_col()

相反,使用%in%会给出所需的堆积面积图,其中包含每个类别的所有观察值:

plottable_investment1 <- chain_investment %>% 
  filter(group_num %in% c(12,17)) %>% 
  mutate(small_cat = case_when(
    group_num == 12 ~ "Transportation",
    group_num == 17 ~ "Education/Health"
  )) %>% 
  group_by(small_cat, year, category) %>% 
  summarise(gross_inv_chain = sum(gross_inv_chain)) %>% 
  ungroup()
#> `summarise()` has grouped output by 'small_cat', 'year'. You can override using the `.groups` argument.

plottable_investment1 %>% 
  filter(category %in% c("Education","Health")) %>% 
  ggplot(aes(year, gross_inv_chain, fill = category)) +
  geom_area()


推荐阅读