首页 > 解决方案 > 如何使密度直方图除以ggplot2中的第二个值?

问题描述

我在 ggplot2 中的密度直方图有问题。我在 RStudio 工作,我正在尝试根据个人职业创建收入密度直方图。我的问题是,当我使用我的代码时:

data = read.table("http://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data",
        sep=",",header=F,col.names=c("age", "type_employer", "fnlwgt", "education", 
                "education_num","marital", "occupation", "relationship", "race","sex",
                "capital_gain", "capital_loss", "hr_per_week","country", "income"),
        fill=FALSE,strip.white=T)

ggplot(data=dat, aes(x=income)) + 
  geom_histogram(stat='count', 
                 aes(x= income, y=stat(count)/sum(stat(count)), 
                     col=occupation, fill=occupation),
                 position='dodge')

作为响应,我得到每个值的直方图除以所有类别的所有值的总数,例如,我希望收入 > 50K 且职业是“工艺维修”的人除以职业是工艺维修的总人数,对于<=50K和相同的职业类别也是如此,对于所有其他类型的职业也是如此

第二个问题是,在做了适当的密度直方图之后,如何按降序对条形图进行排序?

标签: rggplot2histogramdensity-plot

解决方案


This is a situation where it makes sence to re-aggregate your data first, before plotting. Aggregating within the ggplot call works fine for simple aggregations, but when you need to aggregate, then peel off a group for your second calculation, it doesn't work so well. Also, note that because your x axis is discrete, we don't use a histogram here, instead we'll use geom_bar()

First we aggregate by count, then calculate percent of total using occupation as the group.

d2 <- data %>% group_by(income, occupation) %>% 
  summarize(count= n()) %>% 
  group_by(occupation) %>% 
  mutate(percent = count/sum(count))

Then simply plot a bar chart using geom_bar and position = 'dodge' so the bars are side by side, rather than stacked.

 d2 %>% ggplot(aes(income, percent, fill = occupation)) + 
   geom_bar(stat = 'identity', position='dodge')

enter image description here


推荐阅读