首页 > 解决方案 > Complex Grouping, Sorting and Value filtering in Pandas

问题描述

Have a pandas dataframe in the following form:

Col1  Col2   Col3   

 t1       m1   1

 t2     m2   7   

 t3     m1   4

 t4     m2   8

 t5     m3   4

 t6     m3   6

 t7     m3   11

 t8     m4    9

I want to create to labels ( say hi/low/med) into a new column based on the following: Group the data based on Col2 and then sort the data (assuming asscending) based on Col1. Once done then we will choose the

first match of the sorted data (the lowest value) within a group and check the following:

If the first value of Col3 within a group >x :

 if any of values in the remaining group  <=x : 

    Then assign high for the first match to Col 4      #a1

 else 

    assign med for the first match  to Col4          #a2

If the first value of Col 3 within a group <=x and any of values in the

remaining group < or >=x :

  Then assign low for the first match to Col 4        #b1

Also if there is only one element in a group ( as in the case m4 below), then we assign a value high/low based on value >or

At the end drop the rmeaining elements of each groups excpet the first one.

Thus assuming x = 4 and assuming (t1

 Col1  Col2 Col3   Col4

 t1     m1   1      low

 t2     m2   7      med 

 t3     m1   4                    # should drop

 t4     m2   8                   # should drop  

 t5     m3   5      high  

 t6     m3   2                        # should drop 

 t7     m3   11                        # should drop

 t8     m4    9     high


 So final table will be like:

Col1  Col2 Col3   Col4 

 t1     m1   1      low

 t2     m2   7      med 

 t5     m3   5      high 

 t8     m4    9     high

标签: pythonalgorithmpandassorting

解决方案


IIUC,采取这个示例数据框(您的 OP 中有冲突的数据框):

import pandas as pd
import numpy as np

df = pd.DataFrame([
['t1',     'm1',   1],
['t2',     'm2',   7], 
['t3',     'm1',   4],
['t4',     'm2',   8],
['t5',     'm3',   5],
['t6',     'm3',   2],
['t7',     'm3',   11],
['t8',     'm4',    9]],
columns=['Col1',  'Col2',   'Col3'])

产量:

  Col1 Col2  Col3
0   t1   m1     1
1   t2   m2     7
2   t3   m1     4
3   t4   m2     8
4   t5   m3     5
5   t6   m3     2
6   t7   m3    11
7   t8   m4     9

然后用你的逻辑定义你的函数:

def my_function(x, val):

    if x.shape[0]==1:
        if x.iloc[0]>val:
            return 'high'
        else:
            return 'low'

    if x.iloc[0]>val and any(i<=val for i in x.iloc[1:]):
        return 'high'
    elif x.iloc[0]>val:
        return 'med'
    elif x.iloc[0]<=val:
        return 'low'
    else:
        return np.nan

现在将该函数应用于您的排序和分组数据帧,使用4您的示例输入x作为您的 OP 中所述的输入:

df['Col4'] = df.sort_values(['Col2','Col1']).groupby('Col2')['Col3'].transform(my_function, (4))

df = df.sort_values(['Col2','Col1']).groupby('Col2').first().reset_index()

这产生:

  Col2 Col1  Col3  Col4
0   m1   t1     1   low
1   m2   t2     7   med
2   m3   t5     5  high
3   m4   t8     9  high

推荐阅读