首页 > 解决方案 > 如何将数据集与 ggplot2 geom_density() 进行比较

问题描述

这是我之前提出的问题的扩展:
How to extract the density value from ggplot in r

这个水果数据集实际上是 A 国的数据,现在我有 B 国的另一个数据集。我想比较它们的值。但是,A 国和 B 国水果苹果的密度图(y 轴)不同,A 国的最高密度约为 0.8,B 国的最高密度约为 0.4。

示例国家 A: 在此处输入图像描述

Q. B 国也有类似的曲线,但其 y 轴的最高密度值仅为 0.4。那么我该如何比较它们呢?

最小示例的代码:

library(ggplot2) 
set.seed(1234) 
df = data.frame(
    fruits = factor(rep(c("Orange", "Apple", "Pears", "Banana"), each = 200)),
    weight = round(c(rnorm(200, mean = 55, sd=5),
                     rnorm(200, mean=65, sd=5),
                     rnorm(200, mean=70, sd=5),
                     rnorm(200, mean=75, sd=5)))
) 

dim(df) #[1] 800   2
    
ggplot(df, aes(x = weight)) + 
  geom_density() + 
  facet_grid(fruits ~ ., scales = "free", space = "free")
    
g = ggplot(df, aes(x = weight)) + 
  geom_density() + 
  facet_grid(fruits ~ ., scales = "free", space = "free")
    
p = ggplot_build(g)
    
sp = split(p$data[[1]][c("x", "density")], p$data[[1]]$PANEL)
apple_df = sp[[1]]
    
sum(apple_df$density ) # this is equal to 10.43877 but i want it to be one

标签: rggplot2probability-density

解决方案


假设您有两个不同国家/地区的数据框,df_c1并且df_c2. 这个想法是合并两个数据框并添加一个列来区分国家

library(dplyr)
library(ggplot2)

df_c1 = data.frame(
  fruits = factor(rep(c("Orange", "Apple", "Pears", "Banana"), each = 200)),   
  weight = round(c(rnorm(200, mean = 55, sd=5),
                   rnorm(200, mean=65, sd=5), 
                   rnorm(200, mean=70, sd=5), 
                   rnorm(200, mean=75, sd=5)))
)

df_c2 = data.frame(
  fruits = factor(rep(c("Orange", "Apple", "Pears", "Banana"), each = 200)),   
  weight = round(c(rnorm(200, mean = 20, sd=3),
                   rnorm(200, mean=35, sd=6), 
                   rnorm(200, mean=40, sd=2), 
                   rnorm(200, mean=15, sd=4)))
)


df <- rbind(
  df_c1 %>% mutate(country = "country 1"), 
  df_c2 %>% mutate(country = "country 2")
)


df %>% 
  ggplot() + 
  geom_density(aes(x = weight, color = country)) +
  facet_grid(fruits ~ ., scales = "free", space = "free")

曲线下面积

使用分布的另一种可能性是首先使用该density函数,然后表示这些值。

dens1 <- df_c1 %>% 
  group_by(fruits) %>% 
  summarise(x = density(weight)$x, y = density(weight)$y) %>% 
  mutate(country = "country 1")

dens2 <- df_c2 %>% 
  group_by(fruits) %>% 
  summarise(x = density(weight)$x, y = density(weight)$y) %>% 
  mutate(country = "country 2")

df_dens <- rbind(dens1, dens2)

现在ggplot我们使用geom_line

df_dens %>% 
  ggplot() +
  geom_line(aes(x, y, color = country)) + 
  facet_grid(fruits ~ ., scales = "free", space = "free")

如果要测量曲线下的面积,请定义微分。

country == "country 1我们只选择一条曲线,例如fruits == "Apple"

df_single_curve <- df_dens %>% 
  filter(country == "country 1" & fruits == "Apple")

# differential
xx <- df_single_curve$x
dx <- xx[2L] - xx[1L]
yy <- df_single_curve$y

# integral
I <- sum(yy) * dx
I
# [1] 1.000965

推荐阅读