首页 > 解决方案 > find n% of records in a variable in the data frame

问题描述

I have the data in a data frame, with the first column is date and the second column is individual weight. Here's a sample from the data:

df <- data.frame(
  date = c("2019-01-01", "2019-01-01", "2019-01-01", "2019-01-01",
           "2019-01-01", "2019-01-01", "2019-01-01", "2019-01-01",
           "2019-01-01", "2019-01-01", "2019-01-02", "2019-01-02", "2019-01-02",
           "2019-01-02", "2019-01-02", "2019-01-02", "2019-01-02",
           "2019-01-02", "2019-01-02", "2019-01-02"),
  weight = c(2174.8, 2174.8, 2174.8, 8896.53, 8896.53, 2133.51, 2133.51,
             2892.32, 2892.32, 2892.32, 2892.32, 5287.78, 5287.78, 6674.03,
             6674.03, 6674.03, 6674.03, 6674.03, 5535.11, 5535.11)
)

I would like to run simple summary statistic for each date first and then find number of records whose weight is in the given range, defining the category by the % of total range of weights. Finally store number of each record in a separate column

Lowest 10%
10-20%
20-40%
40-60%
60-80%
80-90%
90-100%

The logic = (MinWeight + (MaxWeight-MinWeight)*X%)

Here is my expected outcome ( I only show two columns for % range)

df %>% 
  group_by(date) %>%
  summarise(mean(weight), min(weight), max(weight))
   date       `mean(weight)` `min(weight)` `max(weight)` `Lowest 10%` `10-20%`
 2019-01-01          3726.         2134.         8897.    num records. num records.

标签: rdatatabledplyr

解决方案


Check this solution:

library(tidyverse)
library(wrapr)

df %>%
  group_by(date) %>%
  mutate(
    rn = row_number(),
    temp = weight - min(weight),
    temp = (temp / max(temp)) * 100,
    temp = cut(temp, seq(0, 100, 10), include.lowest = TRUE),
    temp = str_remove(temp, '\\(|\\[') %>%
      str_replace(',', '-') %>%
      str_replace('\\]', '%'),
    one = 1
  ) %>%
  spread(temp, one, fill = 0) %.>%
  left_join(
    summarise(.,
      `mean(weight)` = mean(weight),
      `min(weight)` = min(weight),
      `max(weight)` = max(weight)
    ),
    summarise_at(., vars(matches('\\d+-\\d+.')), sum)
  )

Output:

   date       `mean(weight)` `min(weight)` `max(weight)` `0-10%` `10-20%` `60-70%` `90-100%`
  <fct>               <dbl>         <dbl>         <dbl>   <dbl>    <dbl>    <dbl>     <dbl>
1 2019-01-01          3726.         2134.         8897.       5        3        0         2
2 2019-01-02          5791.         2892.         6674.       1        0        4         5

推荐阅读