r - R中许多fisher.test p值的复杂代码
问题描述
我是 R 的初学者,所以以下内容对我来说非常复杂。
我有以下data.frame
来自纽约市 5 个行政区和 2012-2015 年的数据。对于每一年,有两个类别:P 和 Q。
数据
input_df = data.frame(
Manhattan=c(1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0),
Brooklyn=c(0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0),
Queens=c(1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0),
The_Bronx=c(1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0),
Staten_Island=c(0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0),
"2012"=c("P", "P", "P", "P", "P", "P", "P", "P", "P", "P", "Q", "Q", "Q", "Q", "Q", "Q", "Q", "Q", "Q"),
"2013"=c("P", "P", "P", "P", "P", "P", "P", "P", "Q", "Q", "P", "P", "P", "P", "Q", "Q", "Q", "Q", "Q"),
"2014"=c("P", "P", "P", "Q", "Q", "P", "P", "Q", "Q", "Q", "Q", "Q", "P", "Q", "P", "P", "P", "Q", "Q"),
"2015"=c("P", "P", "P", "P", "P", "Q", "Q", "Q", "P", "Q", "P", "P", "Q", "Q", "Q", "Q", "Q", "Q", "Q"),
check.names=FALSE)
我想系统地确定在任何两个行政区中,P 类事件(“1”)是否比 Q 类事件更频繁(反之亦然),使用fisher.test
.
因此,例如:在 2012 年,曼哈顿和布鲁克林的事件同时发生(在同一行中均为“1”)在 P 类中比在 Q 类中更频繁吗?这是 P 的 10 分中的 4 分和 Q 的 9 分中的 0 分,所以fisher.test(matrix(c(4,6,0,9), nrow=2))$p.value
等于0.08668731
。
有没有办法系统地做到这一点?请参阅下面的简单开始和我的理想输出data.frame
。我会对任何接近这个输出的东西感到满意。谢谢你。
代码(只是一个开始)
library(reshape2)
input_df <- melt(input_df, measure.vars = 6:9) # transform the data
# can maybe use: function x {fisher.test(matrix(x, nrow=2))}
# how to proceed?
理想输出
# ideally hoping to get output similar to this:
output_df = data.frame(
borough_1=c("Manhattan", "Manhattan", "Manhattan", "Manhattan", "Manhattan", "Manhattan", "etc"),
borough_2=c("Brooklyn", "Brooklyn", "Brooklyn", "Brooklyn", "Queens", "Queens", "etc"),
year=c("2012", "2013", "2014", "2015", "2012", "2013", "etc"),
P_both_boroughs_1=c("4", "2", "1", "2", "4", "4", "etc"),
P_not_both_boroughs_1=c("6", "11", "8", "6", "6", "8", "etc"),
Q_both_boroughs_1=c("0", "2", "3", "2", "1", "1", "etc"),
Q_not_both_boroughs_1=c("9", "5", "7", "9", "8", "6", "etc"),
fisher.test.pval=c("0.086687307", "0.586790506", "0.582043344", "1", "0.303405573", "0.602683179", "etc"),
check.names=FALSE)
编辑@user2974951
user2974951,您能帮我在以下替代方案上顺利运行相同的代码input_df
吗?如果我使用input_df
它,不幸的是它会抛出一个错误,因为tmp3
它不再是 2x2 表。我将衷心感谢您的帮助。谢谢你。
input_df = data.frame(
Manhattan=c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0),
Brooklyn=c(0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0),
Queens=c(1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0),
The_Bronx=c(1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0),
Staten_Island=c(0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0),
"2012"=c("P", "P", "P", "P", "P", "P", "P", "P", "P", "P", "Q", "Q", "Q", "Q", "Q", "Q", "Q", "Q", "Q"),
"2013"=c("P", "P", "P", "P", "P", "P", "P", "P", "Q", "Q", "P", "P", "P", "P", "Q", "Q", "Q", "Q", "Q"),
"2014"=c("P", "P", "P", "Q", "Q", "P", "P", "Q", "Q", "Q", "Q", "Q", "P", "Q", "P", "P", "P", "Q", "Q"),
"2015"=c("P", "P", "P", "P", "P", "Q", "Q", "Q", "P", "Q", "P", "P", "Q", "Q", "Q", "Q", "Q", "Q", "Q"),
check.names=FALSE)
解决方案
我将按如下方式解决此问题。首先我加载我将用于分析的包
# packages
library(dplyr)
library(tidyr)
library(purrr)
并创建数据集。
# data
input_df <- tibble(
Manhattan = c(1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0),
Brooklyn = c(0, 0, 0, 1, 0, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0),
Queens = c(1, 1, 0, 0, 1, 1, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 1, 0, 0),
The_Bronx = c(1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0, 0, 1, 1, 1, 0, 0),
Staten_Island = c(0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0),
"2012" = c("P", "P", "P", "P", "P", "P", "P", "P", "P", "P", "Q", "Q", "Q", "Q", "Q", "Q", "Q", "Q", "Q"),
"2013" = c("P", "P", "P", "P", "P", "P", "P", "P", "Q", "Q", "P", "P", "P", "P", "Q", "Q", "Q", "Q", "Q"),
"2014" = c("P", "P", "P", "Q", "Q", "P", "P", "Q", "Q", "Q", "Q", "Q", "P", "Q", "P", "P", "P", "Q", "Q"),
"2015" = c("P", "P", "P", "P", "P", "Q", "Q", "Q", "P", "Q", "P", "P", "Q", "Q", "Q", "Q", "Q", "Q", "Q")
)
head(input_df)
#> # A tibble: 6 x 9
#> Manhattan Brooklyn Queens The_Bronx Staten_Island `2012` `2013` `2014`
#> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <chr>
#> 1 1 0 1 1 0 P P P
#> 2 1 0 1 1 0 P P P
#> 3 0 0 0 0 0 P P P
#> 4 1 1 0 0 0 P P Q
#> 5 1 0 1 0 0 P P Q
#> 6 1 1 1 0 0 P P P
#> # ... with 1 more variable: `2015` <chr>
然后我将您的数据集从宽结构更改为长结构。列year
和borough
采用值2012
, ...2015
和Manhattan
, ...,Staten_Island
而列category
和采用数据集中和的flag
组合的相应值。我需要这个结构用于后续功能。borough
year
# tidying
tidy_input_df <- input_df %>%
gather("year", "category", `2012`:`2015`) %>%
gather("borough", "flag", -category, -year)
tidy_input_df
#> # A tibble: 380 x 4
#> year category borough flag
#> <chr> <chr> <chr> <dbl>
#> 1 2012 P Manhattan 1
#> 2 2012 P Manhattan 1
#> 3 2012 P Manhattan 0
#> 4 2012 P Manhattan 1
#> 5 2012 P Manhattan 1
#> 6 2012 P Manhattan 1
#> 7 2012 P Manhattan 1
#> 8 2012 P Manhattan 0
#> 9 2012 P Manhattan 1
#> 10 2012 P Manhattan 1
#> # ... with 370 more rows
我还需要一个包含所有行政区名称的向量
borough <- unique(tidy_input_df$borough)
现在我必须以这样一种方式修改您的数据集,即每年我都有两列,其中包含每个可能的两个行政区(即曼哈顿 - 布鲁克林、曼哈顿 - 皇后区等)以及相应的值。由于我需要每年重复相同的过程,因此我将数据嵌套在年份
nested_input_df <- nest(tidy_input_df, -year)
nested_input_df
#> # A tibble: 4 x 2
#> year data
#> <chr> <list>
#> 1 2012 <tibble [95 x 3]>
#> 2 2013 <tibble [95 x 3]>
#> 3 2014 <tibble [95 x 3]>
#> 4 2015 <tibble [95 x 3]>
并创建一个执行我上面描述的过程的新函数。我现在可以使用这里nest
描述的-map
方法。
函数的第一部分在数据框中创建一个新列,表示类别和自治市镇的每个组合的唯一 ID,而代码的第二部分创建一个新的数据框,其中所有自治市镇组合一次取 2 个,并将flag 和 category 的对应值(即 0/1 和 P/Q)。
create_boroughs_combinations <- function(data, borough) {
# Create a unique ID for all combinations of category
# and borough
data <- data %>%
group_by(category, borough) %>%
mutate(ID = 1:n()) %>%
ungroup()
# Create all combinations of n boroughs taken 2 at a time.
t(combn(length(borough), 2)) %>%
# transorm that matrix in a tibble
as_tibble(.name_repair = ~ c("borough_1", "borough_2")) %>%
# associate each matrix value to the corresponding borough name
mutate(borough_1 = borough[borough_1], borough_2 = borough[borough_2]) %>%
# join the two dataframes wrt the name of the first borough
inner_join(data, by = c("borough_1" = "borough")) %>%
# joint the two dataframes wrt the name of the second column, the category
# and the unique ID
inner_join(data, by = c("borough_2" = "borough", "category", "ID")) %>%
# create a new variable that checks if the incidents occurred at the same time
mutate(equal = factor(flag.x == 1 & flag.y == 1, levels = c(TRUE, FALSE)))
}
现在我可以将该功能应用于nested_input
使用该map
功能。我必须使用map
,因为我需要每年单独应用该功能。这就是结果。flag.x
是flag
第一个行政区flag.y
的值,而 是flag
第二个行政区的值。
unnested_input_df <- nested_input_df %>%
mutate(data = map(data, create_boroughs_combinations, borough = borough)) %>%
unnest()
unnested_input_df
#> # A tibble: 760 x 8
#> year borough_1 borough_2 category flag.x ID flag.y equal
#> <chr> <chr> <chr> <chr> <dbl> <int> <dbl> <fct>
#> 1 2012 Manhattan Brooklyn P 1 1 0 FALSE
#> 2 2012 Manhattan Brooklyn P 1 2 0 FALSE
#> 3 2012 Manhattan Brooklyn P 0 3 0 FALSE
#> 4 2012 Manhattan Brooklyn P 1 4 1 TRUE
#> 5 2012 Manhattan Brooklyn P 1 5 0 FALSE
#> 6 2012 Manhattan Brooklyn P 1 6 1 TRUE
#> 7 2012 Manhattan Brooklyn P 1 7 0 FALSE
#> 8 2012 Manhattan Brooklyn P 0 8 0 FALSE
#> 9 2012 Manhattan Brooklyn P 1 9 1 TRUE
#> 10 2012 Manhattan Brooklyn P 1 10 1 TRUE
#> # ... with 750 more rows
现在我可以使用相同的想法并创建一个新函数来估计 Fisher 测试的 pvalue 并将其应用于年份和几个行政区的每个组合。我再次嵌套我的数据:
nested_input_df <- unnested_input_df %>%
nest(-year, -borough_1, -borough_2)
nested_input_df
#> # A tibble: 40 x 4
#> year borough_1 borough_2 data
#> <chr> <chr> <chr> <list>
#> 1 2012 Manhattan Brooklyn <tibble [19 x 5]>
#> 2 2012 Manhattan Queens <tibble [19 x 5]>
#> 3 2012 Manhattan The_Bronx <tibble [19 x 5]>
#> 4 2012 Manhattan Staten_Island <tibble [19 x 5]>
#> 5 2012 Brooklyn Queens <tibble [19 x 5]>
#> 6 2012 Brooklyn The_Bronx <tibble [19 x 5]>
#> 7 2012 Brooklyn Staten_Island <tibble [19 x 5]>
#> 8 2012 Queens The_Bronx <tibble [19 x 5]>
#> 9 2012 Queens Staten_Island <tibble [19 x 5]>
#> 10 2012 The_Bronx Staten_Island <tibble [19 x 5]>
#> # ... with 30 more rows
定义函数:
run_fisher_test <- function(data) {
data <- data %>%
select(category, equal)
fisher.test(table(data))$p.value
}
应用它,结果如下:
result <- nested_input_df %>%
mutate(p.value = map_dbl(data, run_fisher_test)) %>%
select(-data)
result
#> # A tibble: 40 x 4
#> year borough_1 borough_2 p.value
#> <chr> <chr> <chr> <dbl>
#> 1 2012 Manhattan Brooklyn 0.0867
#> 2 2012 Manhattan Queens 0.303
#> 3 2012 Manhattan The_Bronx 0.303
#> 4 2012 Manhattan Staten_Island 1
#> 5 2012 Brooklyn Queens 1
#> 6 2012 Brooklyn The_Bronx 1
#> 7 2012 Brooklyn Staten_Island 1
#> 8 2012 Queens The_Bronx 0.350
#> 9 2012 Queens Staten_Island 1
#> 10 2012 The_Bronx Staten_Island 1
#> # ... with 30 more rows
由reprex 包(v0.3.0)于 2019 年 9 月 10 日创建
我希望这很清楚。如果您有任何疑问,请评论这篇文章。我知道这不是最简单的方法,但我真的很喜欢nest
-map
方法,如果你理解它,它会非常灵活。
推荐阅读
- webpack - handlebars-helpers /node_modules/log-utils/ 无法静态分析 'require(..., ...)' webpack
- redirect - Apache2:从自定义链接重定向到 html 文件
- mongodb - 如何从 mongo 管道中检索每个单个数组元素?
- google-cloud-run - 如何在已部署的 Cloud Run Service(托管)中使用 Google Secret?
- docker - 为 gitea 设置 docker 容器中的卷
- javascript - 无法在 js react-native 中设置状态
- c# - ASP.net Core Razor 页面中的登录模式
- java - 找不到适用于 SQL Server 的 JDBC 驱动程序
- ios - 有没有办法在公共容器中使用 iCloud KeyValue Storage?
- reactjs - 用临时对象设置钩子