首页 > 解决方案 > deciding on the number of significant digits in a data frame

问题描述

I have a huge data frame, a sample of 3 columns and 11 rows is given below:

df <- structure(list(A = c(61960, 273, 439, 38877, 75325, 80929, 
23028, 57240, 10140, 25775, 7286), B = c(10, 12, 11, 13, 2, 1, 1, 
1, 1, 1, 1), C = c(122, 140, 163, 12, 190, 16, 14, 18, 15, 17, 16
)), class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, 
-11L))

For each column of the data frame, I would like to calculate the median number of significant digits for each order of magnitude in that column.

So for example, for column A above, there are 3 orders of magnitude present (10^3, 10^4, 10^5). The first number has 4 digits (last zero doesn't count), second has 3, and so on.

My output should be a list for each column, with one element a vector containing the orders of magnitude, and the second the median number of significant digits. So for each column I am expecting a list, my output would be a list of lists. For example for column A:

L[["A"]] = list(c(5,4,3), c(5, 4, 3))

Why is this the list? In column A there are 3 different orders of magnitude: 10^5, 10^4, 10^3. The median number of significant digits for the 10^5 o.o.m is 5, for 10^4, 4, and for 10^3, 3.

Is there a way to do this efficiently? with something like mutate or map (not apply, because this would be the same as using a loop).

标签: rdataframevectorization

解决方案


We can do this by looping over the columns, then grouped by the nchar of the column, remove the 0s at the end with sub, get the median and return a list of the median along with the grouping variable in tapply (returned as the names of the named vector)

lapply(df, function(x) {
      x1 <- tapply(nchar(sub("0+$", "", x)), nchar(x), FUN = median )
      list(as.integer(names(x1)), as.numeric(x1))

   })
#$A
#$A[[1]]
#[1] 3 4 5

#$A[[2]]
#[1] 3 4 5


#$B
#$B[[1]]
#[1] 1 2

#$B[[2]]
#[1] 1 2


#$C
#$C[[1]]
#[1] 2 3

#$C[[2]]
#[1] 2.0 2.5

Or this can be also done with tidyverse and return as a single dataset

library(tidyverse)
df %>%
   mutate_all(str_remove, "0+$") %>%
   map2_dfr(., df,  ~ 
        tibble(x = nchar(.x), grp = nchar(.y)) %>% 
          group_by(grp) %>%
          summarise(x = median(x)), .id = 'colName')
# A tibble: 7 x 3
#  colName   grp     x
#  <chr>   <int> <dbl>
#1 A           3   3  
#2 A           4   4  
#3 A           5   5  
#4 B           1   1  
#5 B           2   2  
#6 C           2   2  
#7 C           3   2.5

推荐阅读