首页 > 解决方案 > 为python中的每个唯一值在多行中搜索多个值

问题描述

我有 3 个字段 1:) 发票编号 2:) 发票子编号和 3:) 发票金额。每个唯一的发票编号可能有多个发票子编号。要求是,对于多行中的每个唯一发票编号,如果发票子编号以 1200 和 2100 开头,则应引入一个虚拟列,显示“1200 和 2100 都存在”,否则如果行有发票子编号从 1200 开始,虚拟列应该是“只有 1200”,否则应该说“只有 2100”。示例如下

S.no Invoice #    Invoice Sub Number    Amount    Dummy
----------------------------------------------
 1.   1234              1230             $100  Both 2100 and 1200 exists
 2.   1234              2100             $100  Both 2100 and 1200 exists
 3.   1234              1200             $100  Both 2100 and 1200 exists
 4.   1245              5430             $50   Only 1200 exists 1245      
 5.   1245              1200             $80   Only 1200 exists

我在 python 中尝试了以下命令,但它不起作用需要对使用的同一命令的帮助

df1= df
df1['Invoice #'] = df1['Invoice #'].astype(object)
df['Invoice sub Number'] = df['Invoice sub Number'].astype(str)
df1= df1.groupby(df['Invoice sub Number','Invoice #'].size().groupby(level=0).size())

df1['dummy']= np.where(df1['Invoice sub Number'].str.startswith ('1200'),'Contains 1200 only',
               np.where(df1['Invoice sub Number'].str.startswith ('2100'),'Contains 2100 only',
                        np.where((df1['Invoice sub Number'].str.startswith ('1200'))&(df1['Invoice sub Number'].str.startswith ('2100')),
                                 'Contains both 1200 and 2100','Contains neither 1200 nor 2100')))

我得到的错误:-KeyError: ('Invoice sub Number', 'Invoice #')

标签: pythonpandasnumpy

解决方案


我建议使用GroupBy.anywithtransform来检查每个组至少一个Trues ,然后按条件逐列检查numpy.select

采用:

print (df)
    Invoice #  Invoice sub Number  Amount
0         123                1234     100
1         123                2345     200
2         123                3456     300
3         123                1200     400
4         123                2100     500
5        1234                1245     600
6        1234                2344     700
7        1234                1200     800
8        2345                 345     900
9        2345                2100    1000
10       2345                2458    1100
11       6789                2345    1200
12       6789                3421    1300
13       6789                1234    1400

m1 = df['Invoice sub Number'].astype(str).str.startswith('1200')    
m2 = df['Invoice sub Number'].astype(str).str.startswith('2100')

m11 = m1.groupby(df['Invoice #']).transform('any')
m22 = m2.groupby(df['Invoice #']).transform('any')

masks =[ m11 & m22 , m11, m22]
vals = ['Contains both 1200 and 2100', 'Contains 1200 only','Contains 2100 only']
default = 'Contains neither 1200 nor 2100'         

df['dummy'] = np.select(masks, vals, default=default)

print (df)
    Invoice #  Invoice sub Number  Amount                           dummy
0         123                1234     100     Contains both 1200 and 2100
1         123                2345     200     Contains both 1200 and 2100
2         123                3456     300     Contains both 1200 and 2100
3         123                1200     400     Contains both 1200 and 2100
4         123                2100     500     Contains both 1200 and 2100
5        1234                1245     600              Contains 1200 only
6        1234                2344     700              Contains 1200 only
7        1234                1200     800              Contains 1200 only
8        2345                 345     900              Contains 2100 only
9        2345                2100    1000              Contains 2100 only
10       2345                2458    1100              Contains 2100 only
11       6789                2345    1200  Contains neither 1200 nor 2100
12       6789                3421    1300  Contains neither 1200 nor 2100
13       6789                1234    1400  Contains neither 1200 nor 2100

推荐阅读