首页 > 解决方案 > 如何从每个人的计数中计算各种概率

问题描述

我有一个数据框df,其中包含每个用户的响应,如下所示:

userID pred_task obs1_task obs2_task exp1_task exp2_task postPOE_task

3108       H        E        E        E         M           M
3207       H        H        E        NA        H           M
3350       M        H        H        NA        H           H
3961       E        E        E        E         E           M
4021       H        H        E        M         H           E

通过一些预处理,我已经能够拥有包含以下计数的附加功能:

E=参与者报告E的次数,

M=参与者报告M的次数,

H=参与者报告H的次数,

此外,我还计算了二元转换(两个连续序列的计数),其中:

EE--> 从EE的过渡,

EM--> 从EM的过渡,

EH--> 从EH的过渡,

ME--> 从ME的过渡,

MM--> 从MM的转换,

MH--> 从MH的过渡,

HE--> 从HE的过渡,

HM--> 从HM的过渡,

HH--> 从HH的转换,

更新df内容如下:

userID pred_tsk obs1_tsk obs2_tsk exp1_tsk exp2_tsk postPOE_tsk  E   M   H   EE  EM  EH  MM  ME  MH  HH  HE  HM
3108       H        E        E        E         M           M    3   2   1   2   1    0   1   0   0   0   0   0
3207       H        H        E        NA        H           M    1   1   3    0   0   0   0   0   0   1   1   1
3350       M        H        H        NA        H           H    0   1   4    0   0   0   0   0   1   2   0   0
3961       E        E        E        E         E           M    5   1   0    4   1   0   0   0   0   0   0   0
4021       H        H        E        M         H           E    2   1   3    0   1   0   0   0   1   1   2   0

请注意:

. 用户可以报告 6 个不同的任务。

. EM并且H是用户报告任务的次数的计数HardMixed或者Easy。这些计数的总和可以(最多)为 6。

. EE, EM, EH, MM, ME, MH, HH, HE,HM是报告响应的转换。例如,在用户报告的一项任务E上,然后在他们报告的另一项任务上,M这就是过渡EM。相反,对于其他过渡

问题:

我有兴趣计算不同的频率P(next)P(prev)并且P(next|prev)对于每个报告的状态,即,

P(next_E), P(next_M),P(next_H)

P(prev_E), P(prev_M),P(prev_H)

其中,公式或概率描述如下:

P(next)= 一个状态作为下一个状态出现的次数比例

P(prev)= 一个状态作为前一个状态出现的次数比例

P(next|prev)= 计数(上一个 -> 下一个)/计数(上一个)

我知道这个问题有点长,感谢您阅读本文并感谢您提供任何提示。

以下帖子有点相关如何从频率计算概率

输入(df)

structure(list(userID = structure(c(2L, 2L, 3L, 1L, 2L), .Label = c("E","H","M"), class = "factor"), 
pred_task = structure(c(1L, 2L, 2L, 1L, 2L), .Label = c(" E", " H"), class = "factor"),
obs1_task = structure(c(1L, 1L, 2L, 1L, 1L), .Label = c(" E", " H"), class = "factor"),
obs2_task = structure(c(1L, 3L, 3L, 1L, 2L), .Label = c(" E", " M", " NA"), class = "factor"), 
exp1_task = structure(c(3L, 2L, 2L, 1L, 2L), .Label = c("E", "H", "M"), class = "factor"),
exp2_task = structure(c(4L, 4L, 3L, 1L, 2L), .Label = c("", "E", "H", "M"), class = "factor"), 
postPOE_task = structure(c(4L, 2L, 1L, 5L, 3L), .Label = c("0", "1", "2", "3", "M"), class = "factor"), 
E = c(2L, 1L, 1L, 5L, 1L),
M = c(1L, 3L, 4L, 1L, 3L), 
H = c(2L, 0L, 0L, 0L, 0L), 
EE = c(1L, 0L, 0L, 4L, 1L), 
EM = c(0L, 0L, 0L, 1L, 0L), 
EH = c(1L, 0L, 0L, 0L, 0L), 
MM = c(0L, 0L, 0L, 0L, 0L), 
ME = c(0L, 0L, 1L, 0L, 1L), 
MH = c(0L, 1L, 2L, 0L, 1L), 
HH = c(0L, 1L, 0L, 0L, 2L), 
HE = c(0L, 1L, 0L, 0L, 0L), 
HM = c(NA, NA, NA, 0L, NA)), 
class = "data.frame", row.names = c("3108", "3207", "3350", "3961", "4021"))

标签: rconditional-statementssequenceprobabilityprop

解决方案


这是一个可以计算您需要的比例的函数,即使使用更大的数据框也是如此。只要确保输入数据框的结构与您的示例完全相同df

另外,我想我发现你的dput(df)相对于你更新的df. 我修复了dput(df)以反映您示例中的值df

# "fixed" df to reflect example    
structure(list(userID = structure(c(2L, 2L, 3L, 1L, 2L), .Label = c("E","H","M"), class = "factor"), 
                 pred_task = structure(c(1L, 2L, 2L, 1L, 2L), .Label = c(" E", " H"), class = "factor"),
                 obs1_task = structure(c(1L, 1L, 2L, 1L, 1L), .Label = c(" E", " H"), class = "factor"),
                 obs2_task = structure(c(1L, 3L, 3L, 1L, 2L), .Label = c(" E", " M", " NA"), class = "factor"), 
                 exp1_task = structure(c(3L, 2L, 2L, 1L, 2L), .Label = c("E", "H", "M"), class = "factor"),
                 exp2_task = structure(c(4L, 4L, 3L, 1L, 2L), .Label = c("", "E", "H", "M"), class = "factor"), 
                 postPOE_task = structure(c(4L, 2L, 1L, 5L, 3L), .Label = c("0", "1", "2", "3", "M"), class = "factor"), 
                 E = c(3L, 1L, 0L, 5L, 2L),
                 M = c(2L, 1L, 1L, 1L, 1L), 
                 H = c(1L, 3L, 4L, 0L, 3L), 
                 EE = c(2L, 0L, 0L, 4L, 0L), 
                 EM = c(1L, 0L, 0L, 1L, 1L), 
                 EH = c(0L, 0L, 0L, 0L, 0L), 
                 MM = c(1L, 0L, 0L, 0L, 0L), 
                 ME = c(0L, 0L, 0L, 0L, 0L), 
                 MH = c(0L, 0L, 1L, 0L, 1L), 
                 HH = c(0L, 1L, 2L, 0L, 1L), 
                 HE = c(0L, 1L, 0L, 0L, 2L), 
                 HM = c(0L, 1L, 0L, 0L, 0L)), 
            class = "data.frame", row.names = c("3108", "3207", "3350", "3961", "4021"))

功能:

transition.probs <- function(df) {
  require(dplyr)
  df.states <- df[, c(11:19)]
  state.factors <- colnames(df.states)
  state.total <- sum(df.states, na.rm = TRUE)
  state.sums <- colSums(df.states, na.rm = TRUE)
  state.df <-data.frame(sums = state.sums, id = as.character(state.factors))
  #-------------------------------------------------------------------------------
  state.df[grepl(".E", state.df$id), ] %>%
    .[, 1] %>%
    sum() -> next.E.count
  state.df[grepl("E.", state.df$id), ] %>%
    .[, 1] %>%
    sum() -> prev.E.count
  state.df[grepl(".M", state.df$id), ] %>%
    .[, 1] %>%
    sum() -> next.M.count
  state.df[grepl("M.", state.df$id), ] %>%
    .[, 1] %>%
    sum() -> prev.M.count
  state.df[grepl(".H", state.df$id), ] %>%
    .[, 1] %>%
    sum() -> next.H.count
  state.df[grepl("H.", state.df$id), ] %>%
    .[, 1] %>%
    sum() -> prev.H.count
  #-------------------------------------------------------------------------------
  next.E.p <- next.E.count / state.total
  prev.E.p <- prev.E.count / state.total
  next.M.p <- next.M.count / state.total
  prev.M.p <- prev.M.count / state.total
  next.H.p <- next.H.count / state.total
  prev.H.p <- prev.H.count / state.total
  #-------------------------------------------------------------------------------
  state.df[grepl("EE", state.df$id), ] %>%
    .[, 1] -> EE.count
  state.df[grepl("EM", state.df$id), ] %>%
    .[, 1] -> EM.count
  state.df[grepl("EH", state.df$id), ] %>%
    .[, 1] -> EH.count
  state.df[grepl("MM", state.df$id), ] %>%
    .[, 1] -> MM.count
  state.df[grepl("ME", state.df$id), ] %>%
    .[, 1] -> ME.count
  state.df[grepl("MH", state.df$id), ] %>%
    .[, 1] -> MH.count
  state.df[grepl("HH", state.df$id), ] %>%
    .[, 1] -> HH.count
  state.df[grepl("HE", state.df$id), ] %>%
    .[, 1] -> HE.count
  state.df[grepl("HM", state.df$id), ] %>%
    .[, 1] -> HM.count
  #-------------------------------------------------------------------------------
  EE.E <- EE.count / prev.E.count
  EM.E <- EM.count / prev.E.count
  EH.E <- EH.count / prev.E.count
  MM.M <- MM.count / prev.M.count
  ME.M <- ME.count / prev.M.count
  MH.M <- MH.count / prev.M.count
  HH.H <- HH.count / prev.H.count
  HE.H <- HE.count / prev.H.count
  HM.H <- HM.count / prev.H.count
  #-------------------------------------------------------------------------------
  state.summary <- data.frame(trans.state = as.factor(c(
    "next.E",
    "prev.E",
    "next.M",
    "prev.M",
    "next.H",
    "prev.H",
    "EE.E",
    "EM.M",
    "EH.H",
    "MM.M",
    "ME.M",
    "MH.M",
    "HH.H",
    "HE.H",
    "HM.H")),
    p = as.numeric(c(
      next.E.p,
      prev.E.p,
      next.M.p,
      prev.M.p,
      next.H.p,
      prev.H.p,
      EE.E,
      EM.E,
      EH.E,
      MM.M,
      ME.M,
      MH.M,
      HH.H,
      HE.H,
      HM.H)))
  state.summary
}

输出数据框:

transition.probs(df)
trans.state        p
1       next.E 0.473684
2       prev.E 0.473684
3       next.M 0.263158
4       prev.M 0.157895
5       next.H 0.315789
6       prev.H 0.421053
7         EE.E 0.666667
8         EM.M 0.333333
9         EH.H 0.000000
10        MM.M 0.333333
11        ME.M 0.000000
12        MH.M 0.666667
13        HH.H 0.500000
14        HE.H 0.375000
15        HM.H 0.125000

推荐阅读