r - Continuous data binning based on observation distribution/frequency to decide bin range r dplyr
问题描述
I have now for days without luck scanned the internet for help on this issue. Any suggestions would be highly appreciated! (especially in a tidyverse-friendly syntax)
I have a tibble with approx. 4300 rows/obs and 320 columns. One column is my dependent variable, a continuous numeric column called "RR" (Response Ratios). My goal is to bin the RR values into 10 factor levels. Later for Machine Learning classification.
I have experimented with the cut() function with this code:
df <- era.af.Al_noNaN %>%
rationalize() %>%
drop_na(RR) %>%
mutate(RR_MyQuantile = cut(RR,
breaks = unique(quantile(RR, probs = seq.int(0,1, by = 1 / numbers_of_bins))),
include.lowest = TRUE))
But I have no luck, because my bins come out with equal n in each, however, that does not reflect the distribution of the data.. I have studied a bit here https://towardsdatascience.com/understanding-feature-engineering-part-1-continuous-numeric-data-da4e47099a7b but I simply cannot achieve the same in R.
Here is the distribution of my RR data values grouped into classes *not what I want
解决方案
谢谢!
我还尝试使用 cut() 然后 count()。然后我使用 labels=FALSE 给出标签,这些标签可用于新列的新变异中,其中包含间隔组的字符名称。
numbers_of_bins = 10
df <- era.af.Al_noNaN %>%
rationalize() %>%
drop_na(RR) %>%
mutate(RR_MyQuantile = cut(RR,
breaks = unique(quantile(RR, probs = seq.int(0,1, by = 1 / numbers_of_bins))),
include.lowest = TRUE))
head(df$RR_MyQuantile,10)
df %>%
group_by(RR_MyQuantile) %>%
count()
推荐阅读
- django - 媒体文件夹图像未显示在模板页面中
- flutter - 如何在 Flutter App 中拥有大量移动的 Widget
- exception - 在不更改任何代码的情况下导致异常
- assembly - MIPS:加载字节指令
- jenkins - 如果不在 Jenkinsfile 中运行,则无法定期禁用
- sql - 从生产数据库连接 localdb
- velo - Wix 100% 高度和宽度(全屏)iframe
- python - 根据组出现的次数创建组列
- sql - Oracle SQL INNER JOIN 和 Count(*)=1 不会消除重复项
- django - Django迁移错误:应用程序不提供模型