首页 > 解决方案 > How to create new columns based on whether another group of Columns Exists

问题描述

My Problem is as follows:

I have a dataframe df which has 5 columns say ('A', 'B', 'C', 'D', 'E')

Now I am looking to combine these columns for some other purposes based on the columns where they are in sets say GP1 = [ 'A', 'B', 'D'] and GP2 = ['C','E'] based on which I will create two new columns.

    df['Group1'] = df[GP1].min(axis=1)

    df['Group2'] = df[GP2].max(axis=1)

However, it can be possible based on the data that many times say the column 'A' ( or say 'D' or 'B' or maybe all) may be missing from the first set or maybe the column 'C' or 'E' (or both) may be missing from second set.

So what I am looking for is to do something such that the code will check if any of the columns from first set or second set is missing and then only create the new 'Group1' or 'Group2' if all columns exists in a group and if any of the columns in any set is missing it will then skip creating the new column.

How can I achieve that. I was trying for loops but not helping and becoming complicated logic.

An example when all the columns in both set is there:

       df_in
              A   B   C  D   E
              1   2   3  4   5
              2   4   6  2   3
              1   0   2  4   2
              
    
      df_out 
              A   B   C  D   E   Group1  Group2
              1   2   3  4   5    1       5
              2   4   6  2   3    2       6
              1   0   2  4   2    0       2

An example when say E column from second group is not there:

        df_in 
              A   B   C  D   
              1   2   3  4   
              2   4   6  2   
              1   0   2  4   
              
    
      df_out
              A   B   C  D  Group1  
              1   2   3  4   1       
              2   4   6  2   2       
              1   0   2  4   0  

When both A & D are missing from set A ( and only B is there from set/group 1)

    df_in 
              B   C  E
              2   3  5
              4   6  3
              0   2  2
              
    
    df_out
              B   C   E  Group2
              2   3   5    5
              4   6   3    6
              0   2   2    2

The following case when A from set 1 missing and C from set 2 missing :

    df_in 
              B   D   E
              2   4   5
              4   2   3
              0   4   2
              
    
    df_out 
              B   D   E
              2   4   5
              4   2   3
              0   4   2

Any help in this direction will be immensely appreciated. Thanks

标签: pythonpandaspandas-groupbyfilteringmultiple-columns

解决方案


Here you go, I think you can use this:

df_out = (df_in.assign(Group1=df_in.reindex(gp1, axis=1).dropna().min(axis=1), 
                      Group2=df_in.reindex(gp2, axis=1).dropna().max(axis=1))
               .dropna(axis=1, how='all'))

MCVE:

df_in  = pd.read_clipboard() #Read from copy of df_in in the question above
print(df_in)

#   A  B  C  D  E
# 0  1  2  3  4  5
# 1  2  4  6  2  3
# 2  1  0  2  4  2

gp1 = ['A','B','D']
gp2 = ['C','E']

df_out = (df_in.assign(Group1=df_in.reindex(gp1, axis=1).dropna().min(axis=1), 
                      Group2=df_in.reindex(gp2, axis=1).dropna().max(axis=1))
               .dropna(axis=1, how='all'))

print(df_out)

#   A  B  C  D  E  Group1  Group2
# 0  1  2  3  4  5       1       5
# 1  2  4  6  2  3       2       6
# 2  1  0  2  4  2       0       2

df_in_copy=df_in.copy() #make a copy to reuse later
df_in = df_in.drop('E', axis=1) #Drop Col E
print(df_in)

#    A  B  C  D
# 0  1  2  3  4
# 1  2  4  6  2
# 2  1  0  2  4

df_out = (df_in.assign(Group1=df_in.reindex(gp1, axis=1).dropna().min(axis=1), 
                      Group2=df_in.reindex(gp2, axis=1).dropna().max(axis=1))
               .dropna(axis=1, how='all'))
print(df_out)

#    A  B  C  D  Group1
# 0  1  2  3  4       1
# 1  2  4  6  2       2
# 2  1  0  2  4       0


df_in = df_in_copy.copy() #Copy for copy create
df_in = df_in.drop(['A','D'], axis=1) #Drop Columns A and D
print(df_in)

#    B  C  E
# 0  2  3  5
# 1  4  6  3
# 2  0  2  2

df_out = (df_in.assign(Group1=df_in.reindex(gp1, axis=1).dropna().min(axis=1), 
                      Group2=df_in.reindex(gp2, axis=1).dropna().max(axis=1))
               .dropna(axis=1, how='all'))
print(df_out)

#    B  C  E
# 0  2  3  5
# 1  4  6  3
# 2  0  2  2

推荐阅读