首页 > 解决方案 > 根据另一列python中的值范围创建带有桶的列

问题描述

我有一个样本 df

一个
X 30
150
Z 450
XX 300

我需要创建另一个 C 列,根据一些断点将 B 列存储起来

断点 = [50,100,250,350]

一个 C
X 30 '0-50'
150 '100-250'
Z 450 '>350'
XX 300 '250-350'

我有以下有效的代码

def conditions(i): 
    if i <=50: return '0-50'
    if i > 50 and i <=100: return '50-100'
    if i > 100 and i <=250: return '100-250'
    if i > 250 and i <=350: return '250-350'
    if i > 350: return '>350'

df['C']=df['B'].apply(conditions)

但是我想让breakpts动态。因此,如果我使用不同的中断点,例如 [100,250,300,400],代码应该会根据中断点自动创建不同的存储桶。

关于如何做到这一点的任何想法?

标签: pythonpandasdataframeconditional-statementsapply

解决方案


正如评论中指出的那样,pd.cut()这将是要走的路。您可以使分手动态化并自己设置:

import pandas as pd
import numpy as np

bins = [0,50, 100,250, 350, np.inf]
labels = ["'0-50'","'50-100'","'100-250'","'250-350'","'>350'"]
df['C'] = pd.cut(df['B'], bins=bins, labels=labels)

看看pandas.qcut哪个是基于分位数的离散化函数。


或者,使用np.select

col = 'B'
conditions = [
              df[col].between(0,50),   # inclusive = True is the default
              df[col].between(50,100),  
              df[col].between(100,250),
              df[col].between(250,350),
              df[col].ge(350)
             ]
choices = ["'0-50'","'50-100'","'100-250'","'250-350'","'>350'"]
    
df["C"] = np.select(conditions, choices, default=np.nan)

两者都打印:

    A    B          C
0   X   30     '0-50'
1   Y  150  '100-250'
2   Z  450     '>350'
3  XX  300  '250-350'

推荐阅读