r - 如何从每个人的计数中计算各种概率
问题描述
我有一个数据框df
,其中包含每个用户的响应,如下所示:
userID pred_task obs1_task obs2_task exp1_task exp2_task postPOE_task
3108 H E E E M M
3207 H H E NA H M
3350 M H H NA H H
3961 E E E E E M
4021 H H E M H E
通过一些预处理,我已经能够拥有包含以下计数的附加功能:
E
=参与者报告E的次数,
M
=参与者报告M的次数,
H
=参与者报告H的次数,
此外,我还计算了二元转换(两个连续序列的计数),其中:
EE
--> 从E到E的过渡,
EM
--> 从E到M的过渡,
EH
--> 从E到H的过渡,
ME
--> 从M到E的过渡,
MM
--> 从M到M的转换,
MH
--> 从M到H的过渡,
HE
--> 从H到E的过渡,
HM
--> 从H到M的过渡,
HH
--> 从H到H的转换,
更新df
内容如下:
userID pred_tsk obs1_tsk obs2_tsk exp1_tsk exp2_tsk postPOE_tsk E M H EE EM EH MM ME MH HH HE HM
3108 H E E E M M 3 2 1 2 1 0 1 0 0 0 0 0
3207 H H E NA H M 1 1 3 0 0 0 0 0 0 1 1 1
3350 M H H NA H H 0 1 4 0 0 0 0 0 1 2 0 0
3961 E E E E E M 5 1 0 4 1 0 0 0 0 0 0 0
4021 H H E M H E 2 1 3 0 1 0 0 0 1 1 2 0
请注意:
. 用户可以报告 6 个不同的任务。
. E
,M
并且H
是用户报告任务的次数的计数Hard
,Mixed
或者Easy
。这些计数的总和可以(最多)为 6。
. EE
, EM
, EH
, MM
, ME
, MH
, HH
, HE
,HM
是报告响应的转换。例如,在用户报告的一项任务E
上,然后在他们报告的另一项任务上,M
这就是过渡EM
。相反,对于其他过渡
问题:
我有兴趣计算不同的频率P(next)
,P(prev)
并且P(next|prev)
对于每个报告的状态,即,
P(next_E)
, P(next_M)
,P(next_H)
P(prev_E)
, P(prev_M)
,P(prev_H)
其中,公式或概率描述如下:
P(next)
= 一个状态作为下一个状态出现的次数比例
P(prev)
= 一个状态作为前一个状态出现的次数比例
P(next|prev)
= 计数(上一个 -> 下一个)/计数(上一个)
我知道这个问题有点长,感谢您阅读本文并感谢您提供任何提示。
以下帖子有点相关如何从频率计算概率
输入(df)
structure(list(userID = structure(c(2L, 2L, 3L, 1L, 2L), .Label = c("E","H","M"), class = "factor"),
pred_task = structure(c(1L, 2L, 2L, 1L, 2L), .Label = c(" E", " H"), class = "factor"),
obs1_task = structure(c(1L, 1L, 2L, 1L, 1L), .Label = c(" E", " H"), class = "factor"),
obs2_task = structure(c(1L, 3L, 3L, 1L, 2L), .Label = c(" E", " M", " NA"), class = "factor"),
exp1_task = structure(c(3L, 2L, 2L, 1L, 2L), .Label = c("E", "H", "M"), class = "factor"),
exp2_task = structure(c(4L, 4L, 3L, 1L, 2L), .Label = c("", "E", "H", "M"), class = "factor"),
postPOE_task = structure(c(4L, 2L, 1L, 5L, 3L), .Label = c("0", "1", "2", "3", "M"), class = "factor"),
E = c(2L, 1L, 1L, 5L, 1L),
M = c(1L, 3L, 4L, 1L, 3L),
H = c(2L, 0L, 0L, 0L, 0L),
EE = c(1L, 0L, 0L, 4L, 1L),
EM = c(0L, 0L, 0L, 1L, 0L),
EH = c(1L, 0L, 0L, 0L, 0L),
MM = c(0L, 0L, 0L, 0L, 0L),
ME = c(0L, 0L, 1L, 0L, 1L),
MH = c(0L, 1L, 2L, 0L, 1L),
HH = c(0L, 1L, 0L, 0L, 2L),
HE = c(0L, 1L, 0L, 0L, 0L),
HM = c(NA, NA, NA, 0L, NA)),
class = "data.frame", row.names = c("3108", "3207", "3350", "3961", "4021"))
解决方案
这是一个可以计算您需要的比例的函数,即使使用更大的数据框也是如此。只要确保输入数据框的结构与您的示例完全相同df
。
另外,我想我发现你的dput(df)
相对于你更新的df
. 我修复了dput(df)
以反映您示例中的值df
。
# "fixed" df to reflect example
structure(list(userID = structure(c(2L, 2L, 3L, 1L, 2L), .Label = c("E","H","M"), class = "factor"),
pred_task = structure(c(1L, 2L, 2L, 1L, 2L), .Label = c(" E", " H"), class = "factor"),
obs1_task = structure(c(1L, 1L, 2L, 1L, 1L), .Label = c(" E", " H"), class = "factor"),
obs2_task = structure(c(1L, 3L, 3L, 1L, 2L), .Label = c(" E", " M", " NA"), class = "factor"),
exp1_task = structure(c(3L, 2L, 2L, 1L, 2L), .Label = c("E", "H", "M"), class = "factor"),
exp2_task = structure(c(4L, 4L, 3L, 1L, 2L), .Label = c("", "E", "H", "M"), class = "factor"),
postPOE_task = structure(c(4L, 2L, 1L, 5L, 3L), .Label = c("0", "1", "2", "3", "M"), class = "factor"),
E = c(3L, 1L, 0L, 5L, 2L),
M = c(2L, 1L, 1L, 1L, 1L),
H = c(1L, 3L, 4L, 0L, 3L),
EE = c(2L, 0L, 0L, 4L, 0L),
EM = c(1L, 0L, 0L, 1L, 1L),
EH = c(0L, 0L, 0L, 0L, 0L),
MM = c(1L, 0L, 0L, 0L, 0L),
ME = c(0L, 0L, 0L, 0L, 0L),
MH = c(0L, 0L, 1L, 0L, 1L),
HH = c(0L, 1L, 2L, 0L, 1L),
HE = c(0L, 1L, 0L, 0L, 2L),
HM = c(0L, 1L, 0L, 0L, 0L)),
class = "data.frame", row.names = c("3108", "3207", "3350", "3961", "4021"))
功能:
transition.probs <- function(df) {
require(dplyr)
df.states <- df[, c(11:19)]
state.factors <- colnames(df.states)
state.total <- sum(df.states, na.rm = TRUE)
state.sums <- colSums(df.states, na.rm = TRUE)
state.df <-data.frame(sums = state.sums, id = as.character(state.factors))
#-------------------------------------------------------------------------------
state.df[grepl(".E", state.df$id), ] %>%
.[, 1] %>%
sum() -> next.E.count
state.df[grepl("E.", state.df$id), ] %>%
.[, 1] %>%
sum() -> prev.E.count
state.df[grepl(".M", state.df$id), ] %>%
.[, 1] %>%
sum() -> next.M.count
state.df[grepl("M.", state.df$id), ] %>%
.[, 1] %>%
sum() -> prev.M.count
state.df[grepl(".H", state.df$id), ] %>%
.[, 1] %>%
sum() -> next.H.count
state.df[grepl("H.", state.df$id), ] %>%
.[, 1] %>%
sum() -> prev.H.count
#-------------------------------------------------------------------------------
next.E.p <- next.E.count / state.total
prev.E.p <- prev.E.count / state.total
next.M.p <- next.M.count / state.total
prev.M.p <- prev.M.count / state.total
next.H.p <- next.H.count / state.total
prev.H.p <- prev.H.count / state.total
#-------------------------------------------------------------------------------
state.df[grepl("EE", state.df$id), ] %>%
.[, 1] -> EE.count
state.df[grepl("EM", state.df$id), ] %>%
.[, 1] -> EM.count
state.df[grepl("EH", state.df$id), ] %>%
.[, 1] -> EH.count
state.df[grepl("MM", state.df$id), ] %>%
.[, 1] -> MM.count
state.df[grepl("ME", state.df$id), ] %>%
.[, 1] -> ME.count
state.df[grepl("MH", state.df$id), ] %>%
.[, 1] -> MH.count
state.df[grepl("HH", state.df$id), ] %>%
.[, 1] -> HH.count
state.df[grepl("HE", state.df$id), ] %>%
.[, 1] -> HE.count
state.df[grepl("HM", state.df$id), ] %>%
.[, 1] -> HM.count
#-------------------------------------------------------------------------------
EE.E <- EE.count / prev.E.count
EM.E <- EM.count / prev.E.count
EH.E <- EH.count / prev.E.count
MM.M <- MM.count / prev.M.count
ME.M <- ME.count / prev.M.count
MH.M <- MH.count / prev.M.count
HH.H <- HH.count / prev.H.count
HE.H <- HE.count / prev.H.count
HM.H <- HM.count / prev.H.count
#-------------------------------------------------------------------------------
state.summary <- data.frame(trans.state = as.factor(c(
"next.E",
"prev.E",
"next.M",
"prev.M",
"next.H",
"prev.H",
"EE.E",
"EM.M",
"EH.H",
"MM.M",
"ME.M",
"MH.M",
"HH.H",
"HE.H",
"HM.H")),
p = as.numeric(c(
next.E.p,
prev.E.p,
next.M.p,
prev.M.p,
next.H.p,
prev.H.p,
EE.E,
EM.E,
EH.E,
MM.M,
ME.M,
MH.M,
HH.H,
HE.H,
HM.H)))
state.summary
}
输出数据框:
transition.probs(df)
trans.state p
1 next.E 0.473684
2 prev.E 0.473684
3 next.M 0.263158
4 prev.M 0.157895
5 next.H 0.315789
6 prev.H 0.421053
7 EE.E 0.666667
8 EM.M 0.333333
9 EH.H 0.000000
10 MM.M 0.333333
11 ME.M 0.000000
12 MH.M 0.666667
13 HH.H 0.500000
14 HE.H 0.375000
15 HM.H 0.125000