r - 按天或月合并数据集
问题描述
我有原始数据集。下面的原始数据集是样本数据,具有时间和情绪(正面、中性、负面)。
这个原始数据集是:
created_time neg_sentiment neu_sentiment pos_sentiment
2015-01-12T23:27:53+0000 0 0 1
2015-01-13T00:36:15+0000 0 0 1
2015-01-13T00:39:37+0000 0.02 0 0.98
2015-01-13T01:26:05+0000 0.41 0.59 0
2015-01-15T16:10:46+0000 0.14 0.02 0.84
2015-02-13T02:38:59+0000 0.86 0.1 0
2015-01-13T21:00:15+0000 1 0 0
2015-01-14T04:47:47+0000 0.96 0.04 0
2015-02-14T06:09:17+0000 1 0 0
2015-02-14T06:10:05+0000 1 0 0
2015-01-14T06:44:47+0000 0.65 0.3 0
2015-03-14T06:47:13+0000 0.07 0.93 0
2015-01-14T10:16:09+0000 0 0 1
2015-01-14T10:17:38+0000 0.08 0.85 0.07
2015-01-14T17:30:03+0000 1 0 0
2015-01-14T20:17:43+0000 0.11 0 0.89
2015-01-16T02:49:13+0000 0.5 0.5 0
2015-03-26T13:20:06+0000 1 0 0
2015-01-21T04:26:45+0000 0.39 0.01 0.6
2015-03-21T04:38:49+0000 0.01 0 0.99
使用此数据集,我想制作两个所需的输出:
负比例由 neg_sentiment/(neg_sentiment + neu_sentiment + pos_sentiment) 计算 第一个输出是按月份:
created_time negative_proportion
2015-01 10
2015-02 20
2015-03 5
第二个输出是按天:
created_time negative_proportion
2015-01-12 10
2015-01-13 20
2015-01-14 3
2015-01-15 3
2015-01-16 3
2015-02-13 3
2015-02-14 3
2015-03-14 3
2015-03-21 3
2015-03-26 5
我怎样才能做出想要的输出?你能帮我或建议代码吗?
基于原始数据集生成的“dput”数据如下
structure(list(created_time = structure(c(1L, 2L, 3L, 4L, 12L,
15L, 5L, 6L, 16L, 17L, 7L, 18L, 8L, 9L, 10L, 11L, 13L, 20L, 14L,
19L), .Label = c("2015-01-12T23:27:53+0000", "2015-01-13T00:36:15+0000",
"2015-01-13T00:39:37+0000", "2015-01-13T01:26:05+0000", "2015-01-13T21:00:15+0000",
"2015-01-14T04:47:47+0000", "2015-01-14T06:44:47+0000", "2015-01-14T10:16:09+0000",
"2015-01-14T10:17:38+0000", "2015-01-14T17:30:03+0000", "2015-01-14T20:17:43+0000",
"2015-01-15T16:10:46+0000", "2015-01-16T02:49:13+0000", "2015-01-21T04:26:45+0000",
"2015-02-13T02:38:59+0000", "2015-02-14T06:09:17+0000", "2015-02-14T06:10:05+0000",
"2015-03-14T06:47:13+0000", "2015-03-21T04:38:49+0000", "2015-03-26T13:20:06+0000"
), class = "factor"), neg_sentiment = c(0, 0, 0.02, 0.41, 0.14,
0.86, 1, 0.96, 1, 1, 0.65, 0.07, 0, 0.08, 1, 0.11, 0.5, 1, 0.39,
0.01), neu_sentiment = c(0, 0, 0, 0.59, 0.02, 0.14, 0, 0.04,
0, 0, 0.35, 0.93, 0, 0.85, 0, 0, 0.5, 0, 0.01, 0), pos_sentiment = c(1,
1, 0.98, 0, 0.84, 0, 0, 0, 0, 0, 0, 0, 1, 0.07, 0, 0.89, 0, 0,
0.6, 0.99)), class = "data.frame", row.names = c(NA, -20L))
解决方案
您可以在创建时间使用 lubridate
library(lubridate)
#>
#> Attaching package: 'lubridate'
#> The following object is masked from 'package:base':
#>
#> date
library(tidyverse)
df_example <- structure(list(created_time = structure(c(1L, 2L, 3L, 4L, 12L,
15L, 5L, 6L, 16L, 17L, 7L, 18L, 8L, 9L, 10L, 11L, 13L, 20L, 14L,
19L), .Label = c("2015-01-12T23:27:53+0000", "2015-01-13T00:36:15+0000",
"2015-01-13T00:39:37+0000", "2015-01-13T01:26:05+0000", "2015-01-13T21:00:15+0000",
"2015-01-14T04:47:47+0000", "2015-01-14T06:44:47+0000", "2015-01-14T10:16:09+0000",
"2015-01-14T10:17:38+0000", "2015-01-14T17:30:03+0000", "2015-01-14T20:17:43+0000",
"2015-01-15T16:10:46+0000", "2015-01-16T02:49:13+0000", "2015-01-21T04:26:45+0000",
"2015-02-13T02:38:59+0000", "2015-02-14T06:09:17+0000", "2015-02-14T06:10:05+0000",
"2015-03-14T06:47:13+0000", "2015-03-21T04:38:49+0000", "2015-03-26T13:20:06+0000"
), class = "factor"), neg_sentiment = c(0, 0, 0.02, 0.41, 0.14,
0.86, 1, 0.96, 1, 1, 0.65, 0.07, 0, 0.08, 1, 0.11, 0.5, 1, 0.39,
0.01), neu_sentiment = c(0, 0, 0, 0.59, 0.02, 0.14, 0, 0.04,
0, 0, 0.35, 0.93, 0, 0.85, 0, 0, 0.5, 0, 0.01, 0), pos_sentiment = c(1,
1, 0.98, 0, 0.84, 0, 0, 0, 0, 0, 0, 0, 1, 0.07, 0, 0.89, 0, 0,
0.6, 0.99)), class = "data.frame", row.names = c(NA, -20L))
df_example %>%
group_by(year(created_time),month(created_time)) %>%
summarise_if(is.numeric,~sum(.,na.rm = TRUE)) %>%
mutate(prop = neg_sentiment/(neg_sentiment + neu_sentiment + pos_sentiment))
#> # A tibble: 3 x 6
#> # Groups: year(created_time) [1]
#> `year(created_t… `month(created_… neg_sentiment neu_sentiment pos_sentiment
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2015 1 5.26 2.36 6.38
#> 2 2015 2 2.86 0.14 0
#> 3 2015 3 1.08 0.93 0.99
#> # … with 1 more variable: prop <dbl>
df_example %>%
group_by(as_date(created_time)) %>%
summarise_if(is.numeric,~sum(.,na.rm = TRUE)) %>%
mutate(prop = neg_sentiment/(neg_sentiment + neu_sentiment + pos_sentiment))
#> # A tibble: 11 x 5
#> `as_date(created_time)` neg_sentiment neu_sentiment pos_sentiment prop
#> <date> <dbl> <dbl> <dbl> <dbl>
#> 1 2015-01-12 0 0 1 0
#> 2 2015-01-13 1.43 0.59 1.98 0.358
#> 3 2015-01-14 2.8 1.24 1.96 0.467
#> 4 2015-01-15 0.14 0.02 0.84 0.14
#> 5 2015-01-16 0.5 0.5 0 0.5
#> 6 2015-01-21 0.39 0.01 0.6 0.39
#> 7 2015-02-13 0.86 0.14 0 0.86
#> 8 2015-02-14 2 0 0 1
#> 9 2015-03-14 0.07 0.93 0 0.07
#> 10 2015-03-21 0.01 0 0.99 0.01
#> 11 2015-03-26 1 0 0 1
由reprex 包(v0.3.0)于 2020 年 1 月 8 日创建
推荐阅读
- typescript - 为给定类型键入合并函数的集合
- node.js - 猫鼬模式问题文档无法存储某些字段的值
- python - 我的文件仍在运行,但它不应该运行
- asp.net-core - 操作的冲突方法/路径组合 - Swagger 无法将替代版本与 Route 区分开来
- sql - 在选择中获取最高值
- ansible - Ansible AWX - 在一个 AWX 模板中使用一个剧本部署具有不同凭据的多个主机
- javascript - jQuery 在控制台中工作,但是当我将它保存在我的站点中时,它没有
- django - 如何使用 gatling 测试受 CSRF 保护的站点?
- c# - VB.NET Webservice (asmx) - 返回一个 PDF
- perl - Perl Tk 打开附加对话框