首页 > 解决方案 > 试图通过分位数获取 Diamonds 数据集的计数

问题描述

我正在尝试做一些我认为相当简单但似乎在实现它时遇到问题的事情。我正在使用数据集“diamonds”,我在此处的 dput/structure 命令中列出了该数据集:

 dput(head(diamonds))
    structure(list(carat = c(0.23, 0.21, 0.23, 0.29, 0.31, 0.24), 
        cut = structure(c(5L, 4L, 2L, 4L, 2L, 3L), .Label = c("Fair", 
        "Good", "Very Good", "Premium", "Ideal"), class = c("ordered", 
        "factor")), color = structure(c(2L, 2L, 2L, 6L, 7L, 7L), .Label = c("D", 
        "E", "F", "G", "H", "I", "J"), class = c("ordered", "factor"
        )), clarity = structure(c(2L, 3L, 5L, 4L, 2L, 6L), .Label = c("I1", 
        "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"), class = c("ordered", 
        "factor")), depth = c(61.5, 59.8, 56.9, 62.4, 63.3, 62.8), 
        table = c(55, 61, 65, 58, 58, 57), price = c(326L, 326L, 
        327L, 334L, 335L, 336L), x = c(3.95, 3.89, 4.05, 4.2, 4.34, 
        3.94), y = c(3.98, 3.84, 4.07, 4.23, 4.35, 3.96), z = c(2.43, 
        2.31, 2.31, 2.63, 2.75, 2.48)), row.names = c(NA, -6L), class = c("tbl_df", 
    "tbl", "data.frame"))

我创建了一个这样的lvplot:

library(lvplot)
ggplot(diamonds,
       aes(x=cut,
           y=price))+
  geom_lv()+
  labs(title = "Cut of Diamonds by Price")

它看起来像这样:

LVPLOT

这给出了价格分布的一般感觉,但我想看看是否有办法获得每个分位数的计数。基本上,我想要非常具体地计算以不同价格出售的钻石数量(如果可能,按分位数)。

标签: rggplot2dplyr

解决方案


如果我理解您的想法,您可以创建一个变量,它是 的一般分位数的类别price,并与 进行比较cut,这是一个带有四分位数的示例:

代码

library(tidyverse)

diamonds %>% 
   #Create a variable with the general quartile, but you can change that
   mutate(price_quartile = cut(price,
                               quantile(price,seq(0,1,.25)),
                               include.lowest = TRUE,
                               labels = paste0("Q",1:4))) %>% 
   #Count observations by type of cut and price_quartile
   count(cut,price_quartile) %>% 
   #Calculate the percentage by cut
   group_by(cut) %>% 
   mutate(
      N = sum(n),
      p = 100*n/N
   )

输出

# A tibble: 20 x 5
# Groups:   cut [5]
   cut       price_quartile     n     N     p
   <ord>     <fct>          <int> <int> <dbl>
 1 Fair      Q1                88  1610  5.47
 2 Fair      Q2               462  1610 28.7 
 3 Fair      Q3               667  1610 41.4 
 4 Fair      Q4               393  1610 24.4 
 5 Good      Q1              1057  4906 21.5 
 6 Good      Q2              1087  4906 22.2 
 7 Good      Q3              1634  4906 33.3 
 8 Good      Q4              1128  4906 23.0 
 9 Very Good Q1              3129 12082 25.9 
10 Very Good Q2              2533 12082 21.0 
11 Very Good Q3              3369 12082 27.9 
12 Very Good Q4              3051 12082 25.3 
13 Premium   Q1              2907 13791 21.1 
14 Premium   Q2              3069 13791 22.3 
15 Premium   Q3              3504 13791 25.4 
16 Premium   Q4              4311 13791 31.3 
17 Ideal     Q1              6309 21551 29.3 
18 Ideal     Q2              6344 21551 29.4 
19 Ideal     Q3              4296 21551 19.9 
20 Ideal     Q4              4602 21551 21.4

推荐阅读