r - ggplot2中三个数据集的词频地铁样式图
问题描述
这个问题是https://stackoverflow.com/a/64991805?noredirect=1 的后续问题, dictfreq$freq_media。
dict: apple, pear, pineapple
freq_gov: 12, 13, 10
freq_indiv: 11, 20, 1
freq_media: 13, 21, 9
所需的输出如下所示:https://blog.revolutionanalytics.com/2015/12/r-is-the-fastest-growth-language-on-stackoverflow.html 其中 y 轴具有:
- rank going from 1-3
- list of the words from dict (apple, pear, pineapple), and
x轴有:
- categories of freq_gov, freq_indiv, freq_media
基本上,我想可视化 dict 中每个单词在 gov、indiv 和 media 中的频率比较。
这是迄今为止我一直试图修改的代码模板:
p <- ggplot(mapping = aes(dictfreq, y = rank, group = tag, color = tag)) +
geom_line(size = 1.7, alpha = 0.25, data = dictfreq) +
geom_line(size = 2.5, data = dictfreq %>% filter(tag %in% names(colors)[colors != "gray"])) +
geom_point(size = 4, alpha = 0.25, data = dictfreq) +
geom_point(size = 4, data = dftags4 %>% filter(tag %in% names(colors)[colors != "gray"])) +
geom_point(size = 1.75, color = "white", data = dictfreq) +
geom_text(data = dftags5, aes(label = tag), hjust = -0, size = 4.5) +
geom_text(data = dftags6, aes(label = tag), hjust = 1, size = 4.5) +
scale_color_manual(values = colors) +
ggtitle("The subway-style-rank-year-tag plot:\nPast and the Future") +
xlab("Top Tags by Year in Stackoverflow") +
scale_x_continuous(breaks = seq(min(dftags4$creationyear) - 2,
max(dftags4$creationyear) + 2),
limits = c(min(dftags4$creationyear) - 1.0,
max(dftags4$creationyear) + 0.5))
p
但我无法将它塑造成我的数据。具体来说,我的 x 轴将是三个分类部分(媒体、政府、个人),这在我的数据中不是一个单独的变量。我应该怎么办??
--
编辑:在此处包括实际数据-建议的 dput() :
structure(list(word = c("apple", "apple", "apple",
"mandarin", "mandarin", "mandarin", "orange", "orange", "orange", "pear"),
name = c("freq_ongov", "freq_onindiv", "freq_onmedia", "freq_ongov",
"freq_onindiv", "freq_onmedia", "freq_ongov", "freq_onindiv",
"freq_onmedia", "freq_ongov"), value = c(0, 87, 63, 0, 44,
20, 3, 27, 25, 0), rank = c(26, 85, 70, 26, 61, 42.5, 86,
47, 48, 26)), row.names = c(NA, -10L), groups = structure(list(
name = c("freq_ongov", "freq_onindiv", "freq_onmedia"), .rows = structure(list(
c(1L, 4L, 7L, 10L), c(2L, 5L, 8L), c(3L, 6L, 9L)), ptype = integer(0), class = c("vctrs_list_of",
"vctrs_vctr", "list"))), row.names = c(NA, 3L), class = c("tbl_df",
"tbl", "data.frame"), .drop = TRUE), class = c("grouped_df",
"tbl_df", "tbl", "data.frame"))
需要注意的是,实际数据有 160 个唯一的 dict 词!
--
更新:我按照艾伦的建议进行了操作,pivotlonger() 函数有效,但是当我尝试生成实际的 ggplot 时遇到错误。这是我的代码:
ggplot(mergedicts, aes(name, rank, color = word, group = word)) +
geom_line(size = 200) +
geom_point(shape = 21, fill = "white", size =200) +
scale_y_continuous(breaks = seq(max(mergedicts$rank)), labels = leftlabels,
sec.axis = sec_axis(~., breaks = seq(max(mergedicts$rank)),
labels = rightlabels)) +
scale_x_discrete(expand = c(0.01, 0)) +
guides(color = guide_none()) +
coord_cartesian(clip = "off") +
theme(axis.ticks.length.y = unit(0, "points"))
这给出了错误:
Error: `breaks` and `labels` must have the same length Run `rlang::last_error()` to see where the error occurred.
6.
stop(fallback)
5.
signal_abort(cnd)
4.
abort("`breaks` and `labels` must have the same length")
3.
check_breaks_labels(breaks, labels)
2.
continuous_scale(c("y", "ymin", "ymax", "yend", "yintercept", "ymin_final", "ymax_final", "lower", "middle", "upper", "y0"), "position_c", identity, name = name, breaks = breaks, n.breaks = n.breaks, minor_breaks = minor_breaks, labels = labels, limits = limits, ...
1.
scale_y_continuous(breaks = seq(max(mergedicts$rank)), labels = leftlabels, sec.axis = sec_axis(~., breaks = seq(max(mergedicts$rank)), labels = rightlabels))
有什么建议么??
解决方案
很难遵循您的示例,因为您的数据不是以标准方式呈现的。我认为您的意思是您有一个包含四列的数据框,如下所示:
dictfreq <- data.frame(dict = c("apple", "pear", "pineapple"),
freq_gov = c(12, 13, 10),
freq_indiv = c(11, 20, 1),
freq_media = c(13, 21, 9))
dictfreq
#> dict freq_gov freq_indiv freq_media
#> 1 apple 12 11 13
#> 2 pear 13 20 21
#> 3 pineapple 10 1 9
现在,如果是这种情况,您的第一个任务是将这些数据转换为长格式,并获取三个分类变量中每一个的排名:
library(ggplot2)
library(dplyr)
library(tidyr)
df <- pivot_longer(dictfreq, -1) %>% group_by(name) %>% mutate(rank = rank(value))
df
#> # A tibble: 9 x 4
#> # Groups: name [3]
#> dict name value rank
#> <fct> <chr> <dbl> <dbl>
#> 1 apple freq_gov 12 2
#> 2 apple freq_indiv 11 2
#> 3 apple freq_media 13 2
#> 4 pear freq_gov 13 3
#> 5 pear freq_indiv 20 3
#> 6 pear freq_media 21 3
#> 7 pineapple freq_gov 10 1
#> 8 pineapple freq_indiv 1 1
#> 9 pineapple freq_media 9 1
请注意,对于您的示例,您的类别中的三个字典项目的排名不会改变:pear
始终是最高的,其次是apple
,然后是pineapple
。这并不是一个非常有趣的情节,但让我们现在就开始吧。您需要根据应该出现的水果定义左手轴和右手轴的标签。你可以这样做:
leftlabels <- df$dict[df$name == "freq_gov"]
leftlabels <- leftlabels[order(df$rank[df$name == "freq_gov"])]
rightlabels <- df$dict[df$name == "freq_media"]
rightlabels <- rightlabels[order(df$rank[df$name == "freq_media"])]
现在您可以进行绘图了。您将需要包括一个辅助轴:
ggplot(df, aes(name, rank, color = dict, group = dict)) +
geom_line(size = 4) +
geom_point(shape = 21, fill = "white", size = 4) +
scale_y_continuous(breaks = seq(max(df$rank)), labels = leftlabels,
sec.axis = sec_axis(~., breaks = seq(max(df$rank)),
labels = rightlabels)) +
scale_x_discrete(expand = c(0.01, 0)) +
guides(color = guide_none()) +
coord_cartesian(clip = "off") +
theme(axis.ticks.length.y = unit(0, "points"))
就像我说的,这不是一个非常有趣的情节,因为这正是数据所显示的。但是,如果我们尝试使用更有趣的数据:
dictfreq <- data.frame(dict = c("apple", "pear", "pineapple", "banana", "kiwi"),
freq_gov = c(10, 13, 9, 14, 11),
freq_indiv = c(11, 22, 1, 6, 16),
freq_media = c(13, 21, 9, 10, 8))
现在我们运行完全相同的代码,我们可以看到这更接近您正在寻找的东西:
推荐阅读
- python - 在磁盘上写一个python运行进程
- unix - 我想在 unix cron 上安排一份工作,工作日上午 9 点到下午 5 点,周末 24 小时。你能帮我解决这个问题吗
- java - 不要为添加到列表视图中的新项目设置标签
- python - 产生数字排列
- javascript - Html/CSS/Javascript: Popup onclick not working for second div
- java - 缓冲阅读器/文件阅读器 .readLine() 到 .split() 读取但没有显示到控制台
- c++ - 仅在堆中创建 C++ 对象
- ruby-on-rails - 如何在控制器中渲染视图并分配给变量(Rails)
- django - 如何为本地托管的应用程序设置外部访问?(苹果电脑)
- docker - Docker EE 2.0 - 支持的操作系统