首页 > 解决方案 > R中许多fisher.test p值的复杂代码

问题描述

我是 R 的初学者,所以以下内容对我来说非常复杂。

我有以下data.frame来自纽约市 5 个行政区和 2012-2015 年的数据。对于每一年,有两个类别:P 和 Q。

数据

 input_df = data.frame(
      Manhattan=c(1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0), 
      Brooklyn=c(0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0), 
      Queens=c(1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0), 
      The_Bronx=c(1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0), 
      Staten_Island=c(0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0), 
      "2012"=c("P", "P", "P", "P", "P", "P", "P", "P", "P", "P", "Q", "Q", "Q", "Q", "Q", "Q", "Q", "Q", "Q"), 
      "2013"=c("P", "P", "P", "P", "P", "P", "P", "P", "Q", "Q", "P", "P", "P", "P", "Q", "Q", "Q", "Q", "Q"), 
      "2014"=c("P", "P", "P", "Q", "Q", "P", "P", "Q", "Q", "Q", "Q", "Q", "P", "Q", "P", "P", "P", "Q", "Q"), 
      "2015"=c("P", "P", "P", "P", "P", "Q", "Q", "Q", "P", "Q", "P", "P", "Q", "Q", "Q", "Q", "Q", "Q", "Q"), 
 check.names=FALSE)

我想系统地确定在任何两个行政区中,P 类事件(“1”)是否比 Q 类事件更频繁(反之亦然),使用fisher.test.

因此,例如:在 2012 年,曼哈顿和布鲁克林的事件同时发生(在同一行中均为“1”)在 P 类中比在 Q 类中更频繁吗?这是 P 的 10 分中的 4 分和 Q 的 9 分中的 0 分,所以fisher.test(matrix(c(4,6,0,9), nrow=2))$p.value等于0.08668731

有没有办法系统地做到这一点?请参阅下面的简单开始和我的理想输出data.frame。我会对任何接近这个输出的东西感到满意。谢谢你。

代码(只是一个开始)

 library(reshape2)
 input_df <- melt(input_df, measure.vars = 6:9) # transform the data
 # can maybe use: function x {fisher.test(matrix(x, nrow=2))}
 # how to proceed?

理想输出

 # ideally hoping to get output similar to this:
 output_df = data.frame(
 borough_1=c("Manhattan", "Manhattan", "Manhattan", "Manhattan", "Manhattan", "Manhattan", "etc"), 
 borough_2=c("Brooklyn", "Brooklyn", "Brooklyn", "Brooklyn", "Queens", "Queens", "etc"),
 year=c("2012", "2013", "2014", "2015", "2012", "2013", "etc"), 
 P_both_boroughs_1=c("4", "2", "1", "2", "4", "4", "etc"), 
 P_not_both_boroughs_1=c("6", "11", "8", "6", "6", "8", "etc"), 
 Q_both_boroughs_1=c("0", "2", "3", "2", "1", "1", "etc"), 
 Q_not_both_boroughs_1=c("9", "5", "7", "9", "8", "6", "etc"), 
 fisher.test.pval=c("0.086687307", "0.586790506", "0.582043344", "1", "0.303405573", "0.602683179", "etc"), 
 check.names=FALSE)

编辑@user2974951

user2974951,您能帮我在以下替代方案上顺利运行相同的代码input_df吗?如果我使用input_df它,不幸的是它会抛出一个错误,因为tmp3它不再是 2x2 表。我将衷心感谢您的帮助。谢谢你。

 input_df = data.frame(
      Manhattan=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0), 
      Brooklyn=c(0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0), 
      Queens=c(1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0), 
      The_Bronx=c(1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0), 
      Staten_Island=c(0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0), 
      "2012"=c("P", "P", "P", "P", "P", "P", "P", "P", "P", "P", "Q", "Q", "Q", "Q", "Q", "Q", "Q", "Q", "Q"), 
      "2013"=c("P", "P", "P", "P", "P", "P", "P", "P", "Q", "Q", "P", "P", "P", "P", "Q", "Q", "Q", "Q", "Q"), 
      "2014"=c("P", "P", "P", "Q", "Q", "P", "P", "Q", "Q", "Q", "Q", "Q", "P", "Q", "P", "P", "P", "Q", "Q"), 
      "2015"=c("P", "P", "P", "P", "P", "Q", "Q", "Q", "P", "Q", "P", "P", "Q", "Q", "Q", "Q", "Q", "Q", "Q"), 
 check.names=FALSE)

标签: r

解决方案


我将按如下方式解决此问题。首先我加载我将用于分析的包

# packages
library(dplyr)
library(tidyr)
library(purrr)

并创建数据集。

# data
input_df <- tibble(
  Manhattan = c(1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0),
  Brooklyn = c(0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0),
  Queens = c(1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0),
  The_Bronx = c(1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0),
  Staten_Island = c(0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0),
  "2012" = c("P", "P", "P", "P", "P", "P", "P", "P", "P", "P", "Q", "Q", "Q", "Q", "Q", "Q", "Q", "Q", "Q"),
  "2013" = c("P", "P", "P", "P", "P", "P", "P", "P", "Q", "Q", "P", "P", "P", "P", "Q", "Q", "Q", "Q", "Q"),
  "2014" = c("P", "P", "P", "Q", "Q", "P", "P", "Q", "Q", "Q", "Q", "Q", "P", "Q", "P", "P", "P", "Q", "Q"),
  "2015" = c("P", "P", "P", "P", "P", "Q", "Q", "Q", "P", "Q", "P", "P", "Q", "Q", "Q", "Q", "Q", "Q", "Q")
)
head(input_df)
#> # A tibble: 6 x 9
#>   Manhattan Brooklyn Queens The_Bronx Staten_Island `2012` `2013` `2014`
#>       <dbl>    <dbl>  <dbl>     <dbl>         <dbl> <chr>  <chr>  <chr> 
#> 1         1        0      1         1             0 P      P      P     
#> 2         1        0      1         1             0 P      P      P     
#> 3         0        0      0         0             0 P      P      P     
#> 4         1        1      0         0             0 P      P      Q     
#> 5         1        0      1         0             0 P      P      Q     
#> 6         1        1      1         0             0 P      P      P     
#> # ... with 1 more variable: `2015` <chr>

然后我将您的数据集从宽结构更改为长结构。列yearborough采用值2012, ...2015Manhattan, ...,Staten_Island而列category和采用数据集中和的flag组合的相应值。我需要这个结构用于后续功能。boroughyear

# tidying
tidy_input_df <- input_df %>%
  gather("year", "category", `2012`:`2015`) %>%
  gather("borough", "flag", -category, -year)
tidy_input_df
#> # A tibble: 380 x 4
#>    year  category borough    flag
#>    <chr> <chr>    <chr>     <dbl>
#>  1 2012  P        Manhattan     1
#>  2 2012  P        Manhattan     1
#>  3 2012  P        Manhattan     0
#>  4 2012  P        Manhattan     1
#>  5 2012  P        Manhattan     1
#>  6 2012  P        Manhattan     1
#>  7 2012  P        Manhattan     1
#>  8 2012  P        Manhattan     0
#>  9 2012  P        Manhattan     1
#> 10 2012  P        Manhattan     1
#> # ... with 370 more rows

我还需要一个包含所有行政区名称的向量

borough <- unique(tidy_input_df$borough)

现在我必须以这样一种方式修改您的数据集,即每年我都有两列,其中包含每个可能的两个行政区(即曼哈顿 - 布鲁克林、曼哈顿 - 皇后区等)以及相应的值。由于我需要每年重复相同的过程,因此我将数据嵌套在年份

nested_input_df <- nest(tidy_input_df, -year)
nested_input_df
#> # A tibble: 4 x 2
#>   year  data             
#>   <chr> <list>           
#> 1 2012  <tibble [95 x 3]>
#> 2 2013  <tibble [95 x 3]>
#> 3 2014  <tibble [95 x 3]>
#> 4 2015  <tibble [95 x 3]>

并创建一个执行我上面描述的过程的新函数。我现在可以使用这里nest描述的-map方法。

函数的第一部分在数据框中创建一个新列,表示类别和自治市镇的每个组合的唯一 ID,而代码的第二部分创建一个新的数据框,其中所有自治市镇组合一次取 2 个,并将flag 和 category 的对应值(即 0/1 和 P/Q)。

create_boroughs_combinations <- function(data, borough) {
  # Create a unique ID for all combinations of category
  # and borough
  data <- data %>%
    group_by(category, borough) %>%
    mutate(ID = 1:n()) %>%
    ungroup()

  # Create all combinations of n boroughs taken 2 at a time. 
  t(combn(length(borough), 2)) %>%
  # transorm that matrix in a tibble
    as_tibble(.name_repair = ~ c("borough_1", "borough_2")) %>%
  # associate each matrix value to the corresponding borough name
    mutate(borough_1 = borough[borough_1], borough_2 = borough[borough_2]) %>%
  # join the two dataframes wrt the name of the first borough
    inner_join(data, by = c("borough_1" = "borough")) %>%
  # joint the two dataframes wrt the name of the second column, the category
  # and the unique ID
    inner_join(data, by = c("borough_2" = "borough", "category", "ID")) %>%
  # create a new variable that checks if the incidents occurred at the same time
    mutate(equal = factor(flag.x == 1 & flag.y == 1, levels = c(TRUE, FALSE)))
}

现在我可以将该功能应用于nested_input使用该map功能。我必须使用map,因为我需要每年单独应用该功能。这就是结果。flag.xflag第一个行政区flag.y的值,而 是flag第二个行政区的值。

unnested_input_df <- nested_input_df %>%
  mutate(data = map(data, create_boroughs_combinations, borough = borough)) %>%
  unnest()
unnested_input_df
#> # A tibble: 760 x 8
#>    year  borough_1 borough_2 category flag.x    ID flag.y equal
#>    <chr> <chr>     <chr>     <chr>     <dbl> <int>  <dbl> <fct>
#>  1 2012  Manhattan Brooklyn  P             1     1      0 FALSE
#>  2 2012  Manhattan Brooklyn  P             1     2      0 FALSE
#>  3 2012  Manhattan Brooklyn  P             0     3      0 FALSE
#>  4 2012  Manhattan Brooklyn  P             1     4      1 TRUE 
#>  5 2012  Manhattan Brooklyn  P             1     5      0 FALSE
#>  6 2012  Manhattan Brooklyn  P             1     6      1 TRUE 
#>  7 2012  Manhattan Brooklyn  P             1     7      0 FALSE
#>  8 2012  Manhattan Brooklyn  P             0     8      0 FALSE
#>  9 2012  Manhattan Brooklyn  P             1     9      1 TRUE 
#> 10 2012  Manhattan Brooklyn  P             1    10      1 TRUE 
#> # ... with 750 more rows

现在我可以使用相同的想法并创建一个新函数来估计 Fisher 测试的 pvalue 并将其应用于年份和几个行政区的每个组合。我再次嵌套我的数据:

nested_input_df <- unnested_input_df %>%
  nest(-year, -borough_1, -borough_2)
nested_input_df
#> # A tibble: 40 x 4
#>    year  borough_1 borough_2     data             
#>    <chr> <chr>     <chr>         <list>           
#>  1 2012  Manhattan Brooklyn      <tibble [19 x 5]>
#>  2 2012  Manhattan Queens        <tibble [19 x 5]>
#>  3 2012  Manhattan The_Bronx     <tibble [19 x 5]>
#>  4 2012  Manhattan Staten_Island <tibble [19 x 5]>
#>  5 2012  Brooklyn  Queens        <tibble [19 x 5]>
#>  6 2012  Brooklyn  The_Bronx     <tibble [19 x 5]>
#>  7 2012  Brooklyn  Staten_Island <tibble [19 x 5]>
#>  8 2012  Queens    The_Bronx     <tibble [19 x 5]>
#>  9 2012  Queens    Staten_Island <tibble [19 x 5]>
#> 10 2012  The_Bronx Staten_Island <tibble [19 x 5]>
#> # ... with 30 more rows

定义函数:

run_fisher_test <- function(data) {
  data <- data %>%
    select(category, equal)

  fisher.test(table(data))$p.value
}

应用它,结果如下:

result <- nested_input_df %>%
  mutate(p.value = map_dbl(data, run_fisher_test)) %>%
  select(-data)
result
#> # A tibble: 40 x 4
#>    year  borough_1 borough_2     p.value
#>    <chr> <chr>     <chr>           <dbl>
#>  1 2012  Manhattan Brooklyn       0.0867
#>  2 2012  Manhattan Queens         0.303 
#>  3 2012  Manhattan The_Bronx      0.303 
#>  4 2012  Manhattan Staten_Island  1     
#>  5 2012  Brooklyn  Queens         1     
#>  6 2012  Brooklyn  The_Bronx      1     
#>  7 2012  Brooklyn  Staten_Island  1     
#>  8 2012  Queens    The_Bronx      0.350 
#>  9 2012  Queens    Staten_Island  1     
#> 10 2012  The_Bronx Staten_Island  1     
#> # ... with 30 more rows

reprex 包(v0.3.0)于 2019 年 9 月 10 日创建

我希望这很清楚。如果您有任何疑问,请评论这篇文章。我知道这不是最简单的方法,但我真的很喜欢nest-map方法,如果你理解它,它会非常灵活。


推荐阅读