首页 > 解决方案 > Continuous data binning based on observation distribution/frequency to decide bin range r dplyr

问题描述

I have now for days without luck scanned the internet for help on this issue. Any suggestions would be highly appreciated! (especially in a tidyverse-friendly syntax)

I have a tibble with approx. 4300 rows/obs and 320 columns. One column is my dependent variable, a continuous numeric column called "RR" (Response Ratios). My goal is to bin the RR values into 10 factor levels. Later for Machine Learning classification.

I have experimented with the cut() function with this code:

df <- era.af.Al_noNaN %>%
  rationalize() %>%
  drop_na(RR) %>%
  mutate(RR_MyQuantile = cut(RR,
                              breaks = unique(quantile(RR, probs = seq.int(0,1, by = 1 / numbers_of_bins))), 
                              include.lowest = TRUE)) 

But I have no luck, because my bins come out with equal n in each, however, that does not reflect the distribution of the data.. I have studied a bit here https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b but I simply cannot achieve the same in R.

Here is the distribution of my RR data values grouped into classes *not what I want

标签: rdplyrquantilebinningcontinuous

解决方案


谢谢!

我还尝试使用 cut() 然后 count()。然后我使用 labels=FALSE 给出标签,这些标签可用于新列的新变异中,其中包含间隔组的字符名称。

numbers_of_bins = 10

df <- era.af.Al_noNaN %>%
  rationalize() %>%
  drop_na(RR) %>%
  mutate(RR_MyQuantile = cut(RR,
                              breaks = unique(quantile(RR, probs = seq.int(0,1, by = 1 / numbers_of_bins))), 
                              include.lowest = TRUE))

head(df$RR_MyQuantile,10)

df %>% 
  group_by(RR_MyQuantile) %>% 
  count()

推荐阅读