r - deciding on the number of significant digits in a data frame
问题描述
I have a huge data frame, a sample of 3 columns and 11 rows is given below:
df <- structure(list(A = c(61960, 273, 439, 38877, 75325, 80929,
23028, 57240, 10140, 25775, 7286), B = c(10, 12, 11, 13, 2, 1, 1,
1, 1, 1, 1), C = c(122, 140, 163, 12, 190, 16, 14, 18, 15, 17, 16
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA,
-11L))
For each column of the data frame, I would like to calculate the median number of significant digits for each order of magnitude in that column.
So for example, for column A above, there are 3 orders of magnitude present (10^3, 10^4, 10^5). The first number has 4 digits (last zero doesn't count), second has 3, and so on.
My output should be a list for each column, with one element a vector containing the orders of magnitude, and the second the median number of significant digits. So for each column I am expecting a list, my output would be a list of lists. For example for column A:
L[["A"]] = list(c(5,4,3), c(5, 4, 3))
Why is this the list? In column A there are 3 different orders of magnitude: 10^5, 10^4, 10^3. The median number of significant digits for the 10^5 o.o.m is 5, for 10^4, 4, and for 10^3, 3.
Is there a way to do this efficiently? with something like mutate
or map
(not apply
, because this would be the same as using a loop).
解决方案
We can do this by looping over the columns, then grouped by the nchar
of the column, remove the 0s at the end with sub
, get the median
and return a list
of the median along with the grouping variable in tapply
(returned as the names of the named vector)
lapply(df, function(x) {
x1 <- tapply(nchar(sub("0+$", "", x)), nchar(x), FUN = median )
list(as.integer(names(x1)), as.numeric(x1))
})
#$A
#$A[[1]]
#[1] 3 4 5
#$A[[2]]
#[1] 3 4 5
#$B
#$B[[1]]
#[1] 1 2
#$B[[2]]
#[1] 1 2
#$C
#$C[[1]]
#[1] 2 3
#$C[[2]]
#[1] 2.0 2.5
Or this can be also done with tidyverse
and return as a single dataset
library(tidyverse)
df %>%
mutate_all(str_remove, "0+$") %>%
map2_dfr(., df, ~
tibble(x = nchar(.x), grp = nchar(.y)) %>%
group_by(grp) %>%
summarise(x = median(x)), .id = 'colName')
# A tibble: 7 x 3
# colName grp x
# <chr> <int> <dbl>
#1 A 3 3
#2 A 4 4
#3 A 5 5
#4 B 1 1
#5 B 2 2
#6 C 2 2
#7 C 3 2.5
推荐阅读
- ios - 一列用于紧凑型,两列用于常规尺寸类
- docker - 使用 docker-compose build 构建镜像后,镜像更改没有反映
- angular - Angular:检查注入的服务是否是该服务的“全局”实例
- python-3.x - 来自其他文件的 Python 调用类不会等到输入所有值
- c# - 使用 C# 在现有 XML 中添加元素
- php - 如何按 WooCommerce 管理订单列表中的自定义列值排序?
- java - 获取spring boot jar中资源文件夹的路径
- r - 我正在尝试将大型 csv 按行拆分为单独的 .txt 文件,在 R 中的每个 .txt 中都有一个标题
- firebase - Firebase LatLng 中的 Google 地图标记未出现在地图上
- c# - Firebird 2.5 的 EF Core 编码