r - 使用 dplyr::lag 整理数据框并填充变量
问题描述
我正在尝试清理我的数据,以便将包含“gamecentre-playbyplay-event”的行正下方的每一行标记为目标,“目标”行正下方包含“gamecentre-playbyplay-event”的每一行是标记为主要辅助,并且在“主要辅助”行正下方包含“gamecentre-playbyplay-event”的每一行都标记为辅助辅助。
数据如下所示:
mydata
# A tibble: 15 x 1
value
<chr>
1 "<div class=\"gamecentre-playbyplay-event team-border--lhjmq-bat gamecentre-playby"
2 "<a href=\"/players/14695\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
3 "<a href=\"/players/16639\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
4 "<a href=\"/players/17027\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
5 "<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby"
6 "<a href=\"/players/17453\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
7 "<a href=\"/players/14639\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
8 "<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby"
9 "<a href=\"/players/18061\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
10 "<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
11 "<a href=\"/players/17522\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
12 "<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby"
13 "<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
14 "<a href=\"/players/14639\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
15 "<a href=\"/players/14757\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
不过这里有一些问题。
- 我需要设置条件以便正确标记行。
- 如果没有“辅助辅助”行,则该行被标记为
NA
。 - 如果没有“主要辅助”行,则该行也被标记为
NA
。
我正在尝试使用它,但是当没有主要或次要助攻时dplyr::lag()
,我想要s 令人困惑。NA
这是我到目前为止所拥有的基础:
goals <- mydata %>%
filter(dplyr::lag(str_detect(value, "gamecentre-playbyplay-event team-border"), 1))
goals
# A tibble: 4 x 1
value
<chr>
1 "<a href=\"/players/14695\" class=\"gamecentre__link gamecentre__link--goal\" data-re
2 "<a href=\"/players/17453\" class=\"gamecentre__link gamecentre__link--goal\" data-re
3 "<a href=\"/players/18061\" class=\"gamecentre__link gamecentre__link--goal\" data-re
4 "<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re
这就是我希望我的数据在所有这一切结束时的样子。我认为使用dplyr::lag()
是要走的路,但我不确定。
# A tibble: 4 x 3
goal primary_assist secondary_assist
<chr> <chr> <chr>
1 "<a href=\"/players/14695\" class=\"gam~ "<a href=\"/players/16639\" class=\"gamecent~ "<a href=\"/players/17027\" class=\"gamecentr~
2 "<a href=\"/players/17453\" class=\"gam~ "<a href=\"/players/14639\" class=\"gamecent~ NA
3 "<a href=\"/players/18061\" class=\"gam~ "<a href=\"/players/14752\" class=\"gamecent~ "<a href=\"/players/17522\" class=\"gamecentr~
4 "<a href=\"/players/14752\" class=\"gam~ "<a href=\"/players/14639\" class=\"gamecent~ "<a href=\"/players/14757\" class=\"gamecentr~
有任何想法吗?
输入:
mydata <- structure(list(value = c("<div class=\"gamecentre-playbyplay-event team-border--lhjmq-bat gamecentre-playby",
"<a href=\"/players/14695\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<a href=\"/players/16639\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<a href=\"/players/17027\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby",
"<a href=\"/players/17453\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<a href=\"/players/14639\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby",
"<a href=\"/players/18061\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<a href=\"/players/17522\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<div class=\"gamecentre-playbyplay-event team-border--lhjmq-mon gamecentre-playby",
"<a href=\"/players/14752\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<a href=\"/players/14639\" class=\"gamecentre__link gamecentre__link--goal\" data-re",
"<a href=\"/players/14757\" class=\"gamecentre__link gamecentre__link--goal\" data-re"
)), .Names = "value", class = c("tbl_df", "tbl", "data.frame"
), row.names = c(NA, -15L))
解决方案
一个选项是创建一个分组变量,然后spread
library(tidyverse)
mydata %>%
#create a group based on the occurrence of 'playby'
group_by(grp = cumsum(str_detect(value, 'playby'))) %>%
# filter out the first row of the group that have playby
filter(row_number() > 1) %>%
# create a new category column
mutate(categ = c("goal", "primary_assist", "secondary_assist")[row_number()]) %>%
# spread from long to wide
spread(categ, value) %>%
# remove the grouping column as part of clean up
ungroup %>%
select(-grp)
# A tibble: 4 x 3
# goal primary_assist secondary_assist
# <chr> <chr> <chr>
#1 "<a href=\"/players/14695\" class=\"g… "<a href=\"/players/16639\" class=\"gamece… "<a href=\"/players/17027\" class=\"gamece…
#2 "<a href=\"/players/17453\" class=\"g… "<a href=\"/players/14639\" class=\"gamece… <NA>
#3 "<a href=\"/players/18061\" class=\"g… "<a href=\"/players/14752\" class=\"gamece… "<a href=\"/players/17522\" class=\"gamece…
#4 "<a href=\"/players/14752\" class=\"g… "<a href=\"/players/14639\" class=\"gamece… "<a href=\"/players/14757\" class=\"gamece…
推荐阅读
- redux - 如何在切片中使用存储?
- sql - 将序号添加到数据
- selenium - 如何将 Selenium IDE 测试集成到 Bamboo 中?
- amazon-web-services - log4js-node 是否支持使用任何附加程序登录 AWS S3?
- xml - XSLT - 如何只输出属性值并忽略元素的值
- python - 通过 Keras Flat 层后恢复多波段图像形状
- sql - 基于多种条件的生物信息学过滤/选择
- javascript - 如何将版本号添加到 JS 文件(因为它从 js 文件加载 Web 中的旧数据)
- laravel - Laravel 资源返回数组元素检查不同模型
- c++ - 允许哪些标记作为#include 的参数?