首页 > 解决方案 > 使用其上方行中字符串的第一个单词重命名该行中的字符串

问题描述

我有一个财务报表数据框。数据框为长格式,我需要将其转换为宽数据框以计算新值。列中的某些值Gross Margin before Incentives和是重复的,但它们都属于由其各自值上方的字符串中的第一个单词表示的组。这是我的数据的示例:Cash IncentivesitemGross Margin before Incentives

item                           balance
  <chr>                            <dbl>
1 Happy Sales                    538246.
2 Happy COGS                     691013.
3 Gross Margin before Incentives 732979.
4 Cash Incentives                795271.
5 Sad Sales                      777362.
6 Sad COGS                       641143.

我需要用正上方值中的第一个单词重命名列中的Gross Margin before IncentivesCash Incentives值。例如, 和 中的第 3 行和第 4行应该是and 。有些情况下在组中未观察到。itemGross Margin before IncentivesitemGross Margin before IncentivesCash IncentivesHappy Gross Margin before IncentivesHappy Cash IncentivesCash Incentives

如何重命名实例Gross Margin before Incentives以及Cash Incentives何时发生?我首先对数据进行子集化,以仅包含item我需要修复的值。

到目前为止我的代码:

# subset the data 

strings_to_clean <- c("Sales", "COGS", "Gross Margin before Incentives", "Cash Incentives")
subset_data <- subset(example, grepl(paste(strings_to_clean, collapse="|"), item))

数据:

structure(list(item = c("Happy Sales", "Happy COGS", "Gross Margin before Incentives", 
"Cash Incentives", "Sad Sales", "Sad COGS", "Gross Margin before Incentives", 
"Cash Incentives", "Moody Sales", "Moody COGS", "Gross Margin before Incentives", 
"Cash Incentives", "Quiet Sales", "Quiet COGS", "Gross Margin before Incentives", 
"Loud Sales", "Loud COGS", "Gross Margin before Incentives"), 
    balance = c(538245.742201671, 691013.302590931, 732979.295329896, 
    795270.513380734, 777362.375851466, 641142.897993899, 631125.453588359, 
    719267.992685639, 691097.434737858, 787648.5870127, 773735.724156151, 
    839982.810591622, 594245.171673631, 817496.544758698, 696496.677601699, 
    640577.505041572, 715166.623728619, 646765.952258766)), row.names = c(NA, 
-18L), class = c("tbl_df", "tbl", "data.frame"))

标签: rstringdataframe

解决方案


使用dplyr包你可以做到这一点,通过利用数据的规则结构

library(dplyr)
# define vector of strings to use 
strings_to_clean <- 
    c("Sales", "COGS", "Gross Margin before Incentives", "Cash Incentives")
# code
df %>% 
    mutate(item = 
               ifelse(grepl(paste0(strings_to_clean[3:4], collapse = "|"), item), 
                      paste(gsub(paste0(c(strings_to_clean, "\\s+$"), collapse = "|"), "", lag(item, 2)), item), 
                      item))

这基本上是在说

(1) 如果item有“激励前毛利率”或“现金激励”

(2)item从两行向上连接到公司名称 ( lag(item, 2)) :您通过获取lag(item, 2)和删除所有字符串来找到公司名称strings_to_clean


推荐阅读