r - 当使用 case_when(R 向量化)存在大量类别/类型时,R 应用多个函数
问题描述
假设我有以下形式的数据集:
City=c(1,2,2,1)
Business=c(2,1,1,2)
ExpectedRevenue=c(35,20,15,19)
zz=data.frame(City,Business,ExpectedRevenue)
zz_new=do.call("rbind", replicate(zz, n=30, simplify = FALSE))
我的实际数据集包含大约 200K 行。此外,它还包含 100 多个城市的信息。假设,对于每个城市(我也称之为“类型”),我有以下需要应用的功能:
#Writing the custom functions for the categories here
Type1=function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)+10*rnorm(1)
return(BusinessMax)
}
Type2=function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)-100*rnorm(1)
return(BusinessMax)
}
再一次,上面两个函数是我用来说明的非常简单的函数。这里的想法是,对于每个城市(或“类型”),我需要为数据集中的每一行运行不同的函数。在上述两个函数中,我使用 rnorm 来检查并确保我们为每一行绘制不同的值。
现在对于整个数据集,我想首先将观察结果划分为不同的城市(或“类型”)。我可以使用 (zz_new[["City"]]==1) [另见下文] 来做到这一点。然后为每个类运行各自的函数。但是,当我运行下面的代码时,我得到-Inf。
有人可以帮我理解为什么会这样吗?
对于示例数据,我希望获得 20 加 10 倍的随机值(对于 Type = 1)和 35 减 100 倍的随机值(对于 Type = 2)。每行的值也应该不同,因为我是从随机正态分布中绘制它们的。
library(dplyr) #I use dplyr here
zz_new[,"AdjustedRevenue"] = case_when(
zz_new[["City"]]==1~Type1(full_data=zz_new,observation=zz_new[,]),
zz_new[["City"]]==2~Type2(full_data=zz_new,observation=zz_new[,])
)
提前非常感谢。
解决方案
Let's take a look at your code. I rewrite your code
library(dplyr)
zz_new[,"AdjustedRevenue"] = case_when(
zz_new[["City"]]==1~Type1(full_data=zz_new,observation=zz_new[,]),
zz_new[["City"]]==2~Type2(full_data=zz_new,observation=zz_new[,])
)
to
zz_new %>%
mutate(AdjustedRevenue = case_when(City == 1 ~ Type1(zz_new,zz_new),
City == 2 ~ Type2(zz_new,zz_new)))
since you are using dplyr
but don't use the powerful tools provided by this package.
Besides the usage of mutate
one key change is that I replaced zz_new[,]
with zz_new
. Now we see that both arguments of your Type
-functions are the same dataframe.
Next step: Take a look at your function
Type1 <- function(full_data,observation){
NewSet=full_data[which(!full_data$City==observation$City),]
BusinessMax = max(NewSet$ExpectedRevenue)+10*rnorm(1)
return(BusinessMax)
}
which is called by Type1(zz_new,zz_new)
. So the definition of NewSet
gives us
NewSet=full_data[which(!full_data$City==observation$City),]
# replace the arguments
NewSet <- zz_new[which(!zz_new$City==zz_new$City),]
Thus NewSet
is always a dataframe with zero rows. Applying max
to an empty column of a data.frame yields -Inf
.
推荐阅读
- r - 计算 R 中每个用户 ID 自上次购买以来的天数
- python - 尝试使用for循环找到一种有效的方法来查找python中的所有因素
- node.js - Cheerio 和 axios 的 Promise - 内嵌 for 循环
- python - Django 从 Createview 获取 PK
- python - Python在与另一个列表检查后获取一个列表的索引号
- javascript - 强制两个元素具有相同的宽度
- python - 来自迭代器的随机项?
- angular - Angular 双向绑定 [(ngModel)]
- r - 将带有月份缩写的字符串转换为 POSIXlt 的问题
- ios - 成功登录后导航未激活