首页 > 解决方案 > Concatenate and count the number of elements in each row?

问题描述

df <- structure(list(ID = c("1", "2", "3", "4", "5", "6"), `ID without mask` = c(NA_character_, 
 NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), `Other Years` = c("2011", "2015", "2015", "2006, 2006, 2005, 2005, 2007", 
 "2014, 2011", "2007"), `Cut off Year` = c("2011", "2015", "2015", "2005", 
 "2011", "2007"), `2005` = c(NA, NA, NA, "30", "18", NA), `2006` = c(NA_character_, 
 NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
 ), `2007` = c("15", NA, "18", NA, "30, 18", NA), `2008` = c("16", 
 NA, NA, "30, 27", "18, 30", NA), `2009` = c("15", NA, NA, "20", 
 "30, 18", NA), `2010` = c(NA, NA, NA, "30, 20", NA, NA), `2011` = c(NA_character_, 
 NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
 ), `2012` = c(NA, NA, NA, "20, 30", NA, "26"), `2013` = c("15", 
 NA, "19", NA, NA, NA), `2014` = c(NA, NA, "18", NA, NA, NA), 
`2015` = c(NA, NA, "18", NA, "18", NA), `2016` = c(NA_character_, 
 NA_character_, NA_character_, NA_character_, NA_character_, 
 NA_character_)), .Names = c("ID", "ID without mask", 
 "Other Years", "Cut off Year", "2005", "2006", "2007", "2008", 
 "2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016"
 ), row.names = c(NA, 6L), class = "data.frame")

Given the above data frame. I would like R to concatenate and count every element (for the years) in a row together and then output it in a new column.

Based on the "Cut off Year" column, I want R to concatenate the elements before the Cut off Year together in one column and the elements after the cut off year together in a second column (cut off year included).

So for the first row with a cut off year of 2011, the years 2007, 2008, and 2009 has 15, 16, 15 respectively, so that is 3 in total, so R should output the number 3 in the new column. After 2011, only 2013 has an element with a entry, hence the "after" column will only have the number 1.

Elements such as "30, 27" count as two, etc.

Here's the desired output:

structure(list(ID = c("1", "2", "3", "4", "5", "6"), `ID without mask` = c(NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), `Other Years` = c("2011", "2015", "2015", "2006, 2006, 2005, 2005, 2007", 
"2014, 2011", "2007"), `Cut off Year` = c("2011", "2015", "2015", "2005", 
"2011", "2007"), `2005` = c(NA, NA, NA, "30", "18", NA), `2006` = c(NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), `2007` = c("15", NA, "18", NA, "30, 18", NA), `2008` = c("16", 
NA, NA, "30, 27", "18, 30", NA), `2009` = c("15", NA, NA, "20", 
"30, 18", NA), `2010` = c(NA, NA, NA, "30, 20", NA, NA), `2011` = c(NA_character_, 
NA_character_, NA_character_, NA_character_, NA_character_, NA_character_
), `2012` = c(NA, NA, NA, "20, 30", NA, "26"), `2013` = c("15", 
NA, "19", NA, NA, NA), `2014` = c(NA, NA, "18", NA, NA, NA), 
    `2015` = c(NA, NA, "18", NA, "18", NA), `2016` = c(NA_character_, 
    NA_character_, NA_character_, NA_character_, NA_character_, 
    NA_character_), `Before cut` = c("3", "0", "3", "0", "7", "0"), `After cut` = c("1", "0", "1", "8", "1", "1")), .Names = c("ID", "Collab Years Patents", 
"Collab Years Publications", "Cut off Year", "2005", "2006", "2007", "2008", 
"2009", "2010", "2011", "2012", "2013", "2014", "2015", "2016", "Before cut", "After cut"
), row.names = c(NA, 6L), class = "data.frame")

标签: r

解决方案


我们转换为一个不错的长格式,计算值,然后加入原始格式:

library(tidyr)
library(dplyr)
library(stringr)

gather(df, key = "Year", value = "value", `2005`:`2016`) %>%
    mutate(val_count = str_count(value, pattern = ",") + 1) %>%
    group_by(ID) %>%
    summarize(Before = sum(val_count * (Year < `Cut off Year`), na.rm = TRUE),
              After = sum(val_count * (Year >= `Cut off Year`), na.rm = TRUE)) %>%
    right_join(df) %>%
    select(1:3)
# Joining, by = "ID"
# A tibble: 6 x 3
     ID Before After
  <chr>  <dbl> <dbl>
1     1      3     1
2     2      0     0
3     3      3     1
4     4      0     8
5     5      7     1
6     6      0     1

select(1:3)过去只显示结果的相关部分 - 省略该行以获取所有其他列。为了获得值的数量,我们将逗号的数量加 1。


推荐阅读