首页 > 解决方案 > 是否有更有效的方法来处理在 R 数据框中重复的事实?

问题描述

我有一个如下所示的数据框:

ID <- c(1,1,1,2,2,2,2,3,3,3,3)
Fact <- c(233,233,233,50,50,50,50,15,15,15,15)
Overall_Category <- c("Purchaser","Purchaser","Purchaser","Car","Car","Car","Car","Car","Car","Car","Car")
Descriptor <- c("Country", "Gender", "Eyes", "Color", "Financed", "Type", "Transmission", "Color", "Financed", "Type", "Transmission")
Members <- c("America", "Male", "Brown", "Red", "Yes", "Sedan", "Manual", "Blue","No", "Van", "Automatic")

df <- data.frame(ID, Fact, Overall_Category, Descriptor, Members)

数据框维度的工作方式如下:

问题在于,根据有多少维度适用于给定事实,给定 ID 会重复提交的单个事实。 我想要的是一种仅显示一次事实的方法,基于其 ID,并针对该单个 ID 存储适用的维度。

我通过这样做实现了它:

df1 <- pivot_wider(df, 
id_cols = ID,
names_from = c(Overall_Category, Descriptor, Members),
names_prefix = "zzzz",
values_from = Fact,
names_sep = "-",
names_repair = "unique")

ColumnNames <- df1 %>% select(matches("zzzz")) %>% colnames()


df2 <- df1 %>% mutate(mean_sel = rowMeans(select(., ColumnNames), na.rm = T))
df3 <- df2 %>% mutate_at(ColumnNames, function(x) ifelse(!is.na(x), deparse(substitute(x)), NA))
df3 <- df3 %>% unite('Descriptor', ColumnNames, na.rm = T, sep = "_")
df3 <- df3 %>% mutate_at("Descriptor", str_replace_all, "zzzz", "")

但是,由于pivot_wide,它似乎不能很好地适用于具有许多维度的事实,并且通常看起来不是一种非常有效的方法。

有一个更好的方法吗?

标签: rdataframeduplicateshierarchical-data

解决方案


我认为你想要简单paste的 withsepcollapsearguments

library(dplyr, warn.conflicts = F)

df %>% group_by(ID, Fact) %>%
  summarise(Descriptor = paste(paste(Overall_Category, Descriptor, Members, sep = '-'), collapse = '_'), .groups = 'drop')

# A tibble: 3 x 3
     ID  Fact Descriptor                                                            
  <dbl> <dbl> <chr>                                                                 
1     1   233 Purchaser-Country-America_Purchaser-Gender-Male_Purchaser-Eyes-Brown  
2     2    50 Car-Color-Red_Car-Financed-Yes_Car-Type-Sedan_Car-Transmission-Manual 
3     3    15 Car-Color-Blue_Car-Financed-No_Car-Type-Van_Car-Transmission-Automatic

推荐阅读