首页 > 解决方案 > ggplot2中三个数据集的词频地铁样式图

问题描述

这个问题是https://stackoverflow.com/a/64991805?noredirect=1 的后续问题, dictfreq$freq_media。

dict: apple, pear, pineapple
freq_gov: 12, 13, 10
freq_indiv: 11, 20, 1
freq_media: 13, 21, 9

所需的输出如下所示:https://blog.revolutionanalytics.com/2015/12/r-is-the-fastest-growth-language-on-stackoverflow.html 其中 y 轴具有:

- rank going from 1-3
- list of the words from dict (apple, pear, pineapple), and 

x轴有:

- categories of freq_gov, freq_indiv, freq_media

基本上,我想可视化 dict 中每个单词在 gov、indiv 和 media 中的频率比较。

这是迄今为止我一直试图修改的代码模板:

p <- ggplot(mapping = aes(dictfreq, y = rank, group = tag, color = tag)) +
  geom_line(size = 1.7, alpha = 0.25, data = dictfreq) +
  geom_line(size = 2.5, data = dictfreq %>% filter(tag %in% names(colors)[colors != "gray"])) +
  geom_point(size = 4, alpha = 0.25, data = dictfreq) +
  geom_point(size = 4, data = dftags4 %>% filter(tag %in% names(colors)[colors != "gray"])) +
  geom_point(size = 1.75, color = "white", data = dictfreq) +
  geom_text(data = dftags5, aes(label = tag), hjust = -0, size = 4.5) +
  geom_text(data = dftags6, aes(label = tag), hjust = 1, size = 4.5) +
  scale_color_manual(values = colors) +
  ggtitle("The subway-style-rank-year-tag plot:\nPast and the Future") +
  xlab("Top Tags by Year in Stackoverflow") +
  scale_x_continuous(breaks = seq(min(dftags4$creationyear) - 2,
                                 max(dftags4$creationyear) + 2),
                     limits = c(min(dftags4$creationyear) - 1.0,
                                max(dftags4$creationyear) + 0.5))
p

但我无法将它塑造成我的数据。具体来说,我的 x 轴将是三个分类部分(媒体、政府、个人),这在我的数据中不是一个单独的变量。我应该怎么办??

--

编辑:在此处包括实际数据-建议的 dput() :

structure(list(word = c("apple", "apple", "apple", 
"mandarin", "mandarin", "mandarin", "orange", "orange", "orange", "pear"), 
    name = c("freq_ongov", "freq_onindiv", "freq_onmedia", "freq_ongov", 
    "freq_onindiv", "freq_onmedia", "freq_ongov", "freq_onindiv", 
    "freq_onmedia", "freq_ongov"), value = c(0, 87, 63, 0, 44, 
    20, 3, 27, 25, 0), rank = c(26, 85, 70, 26, 61, 42.5, 86, 
    47, 48, 26)), row.names = c(NA, -10L), groups = structure(list(
    name = c("freq_ongov", "freq_onindiv", "freq_onmedia"), .rows = structure(list(
        c(1L, 4L, 7L, 10L), c(2L, 5L, 8L), c(3L, 6L, 9L)), ptype = integer(0), class = c("vctrs_list_of", 
    "vctrs_vctr", "list"))), row.names = c(NA, 3L), class = c("tbl_df", 
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df", 
"tbl_df", "tbl", "data.frame"))

需要注意的是,实际数据有 160 个唯一的 dict 词!

--

更新:我按照艾伦的建议进行了操作,pivotlonger() 函数有效,但是当我尝试生成实际的 ggplot 时遇到错误。这是我的代码:

ggplot(mergedicts, aes(name, rank, color = word, group = word)) +
  geom_line(size = 200) +
  geom_point(shape = 21, fill = "white", size =200) +
  scale_y_continuous(breaks = seq(max(mergedicts$rank)), labels = leftlabels,
                     sec.axis = sec_axis(~., breaks = seq(max(mergedicts$rank)), 
                                         labels = rightlabels)) +
  scale_x_discrete(expand = c(0.01, 0)) +
  guides(color = guide_none()) +
  coord_cartesian(clip = "off") +
  theme(axis.ticks.length.y = unit(0, "points"))

这给出了错误:

Error: `breaks` and `labels` must have the same length Run `rlang::last_error()` to see where the error occurred.
6.
stop(fallback)
5.
signal_abort(cnd)
4.
abort("`breaks` and `labels` must have the same length")
3.
check_breaks_labels(breaks, labels)
2.
continuous_scale(c("y", "ymin", "ymax", "yend", "yintercept", "ymin_final", "ymax_final", "lower", "middle", "upper", "y0"), "position_c", identity, name = name, breaks = breaks, n.breaks = n.breaks, minor_breaks = minor_breaks, labels = labels, limits = limits, ...
1.
scale_y_continuous(breaks = seq(max(mergedicts$rank)), labels = leftlabels, sec.axis = sec_axis(~., breaks = seq(max(mergedicts$rank)), labels = rightlabels))

有什么建议么??

标签: rggplot2

解决方案


很难遵循您的示例,因为您的数据不是以标准方式呈现的。我认为您的意思是您有一个包含四列的数据框,如下所示:

dictfreq <- data.frame(dict = c("apple", "pear", "pineapple"),
                       freq_gov =  c(12, 13, 10),
                       freq_indiv =  c(11, 20, 1),
                       freq_media = c(13, 21, 9))

dictfreq
#>        dict freq_gov freq_indiv freq_media
#> 1     apple       12         11         13
#> 2      pear       13         20         21
#> 3 pineapple       10          1          9

现在,如果是这种情况,您的第一个任务是将这些数据转换为长格式,并获取三个分类变量中每一个的排名:

library(ggplot2)
library(dplyr)
library(tidyr)

df <- pivot_longer(dictfreq, -1) %>% group_by(name) %>% mutate(rank = rank(value))
df
#> # A tibble: 9 x 4
#> # Groups:   name [3]
#>   dict      name       value  rank
#>   <fct>     <chr>      <dbl> <dbl>
#> 1 apple     freq_gov      12     2
#> 2 apple     freq_indiv    11     2
#> 3 apple     freq_media    13     2
#> 4 pear      freq_gov      13     3
#> 5 pear      freq_indiv    20     3
#> 6 pear      freq_media    21     3
#> 7 pineapple freq_gov      10     1
#> 8 pineapple freq_indiv     1     1
#> 9 pineapple freq_media     9     1

请注意,对于您的示例,您的类别中的三个字典项目的排名不会改变:pear始终是最高的,其次是apple,然后是pineapple。这并不是一个非常有趣的情节,但让我们现在就开始吧。您需要根据应该出现的水果定义左手轴和右手轴的标签。你可以这样做:

leftlabels <- df$dict[df$name == "freq_gov"]
leftlabels <- leftlabels[order(df$rank[df$name == "freq_gov"])]

rightlabels <- df$dict[df$name == "freq_media"]
rightlabels <- rightlabels[order(df$rank[df$name == "freq_media"])]

现在您可以进行绘图了。您将需要包括一个辅助轴:

ggplot(df, aes(name, rank, color = dict, group = dict)) +
  geom_line(size = 4) +
  geom_point(shape = 21, fill = "white", size = 4) +
  scale_y_continuous(breaks = seq(max(df$rank)), labels = leftlabels,
                     sec.axis = sec_axis(~., breaks = seq(max(df$rank)), 
                                         labels = rightlabels)) +
  scale_x_discrete(expand = c(0.01, 0)) +
  guides(color = guide_none()) +
  coord_cartesian(clip = "off") +
  theme(axis.ticks.length.y = unit(0, "points"))

就像我说的,这不是一个非常有趣的情节,因为这正是数据所显示的。但是,如果我们尝试使用更有趣的数据:

dictfreq <- data.frame(dict = c("apple", "pear", "pineapple", "banana", "kiwi"),
                       freq_gov =  c(10, 13, 9, 14, 11),
                       freq_indiv =  c(11, 22, 1, 6, 16),
                       freq_media = c(13, 21, 9, 10, 8))

现在我们运行完全相同的代码,我们可以看到这更接近您正在寻找的东西:

在此处输入图像描述


推荐阅读