首页 > 解决方案 > Pandas逐行值比较以查找字符串相似率高的2行之间的匹配率

问题描述

我正在努力计算我匹配的 2 个字符串相似度高的项目之间的属性匹配率。

我尝试了 2 个变量循环,但出现了类似“IndexError:单个位置索引器超出范围”的错误

我试过的代码是:

nuomlist = pd.DataFrame(dfn.columns, columns=['Col'])
nuomN = nuomlist[nuomlist['Col'].str.contains('-')].index.tolist()

 for i in range(int(nuomN[-1]+1),int(dfn.columns.get_loc("sim_1"))) :
 for j in dfn.index:

  sum(dfn.iloc[j,i]==dfn.iloc[j+dfn.iloc[j,dfn.columns.get_loc('Max_row')],i])/ 
  int(dfn.columns.get_loc("sim_1") - (nuomN[-1] + 1))

这是样本数据集

data = {'S_ITEMCODE':['', '81527800', '', '81527900'],
        'N':['N', '','N', ''],
        'ITEMCODE':['81527800', '81320323', '81527900', '81267337'],
        'DESC':['Store Brand (Woongjin) SB Fresh Orange Drink Orange NO P.BTL 1.5lit', 'Store Brand (Woongjin) SB Fresh Orange Drink Orange NO P.BTL 1lit', 'Store Brand (Woongjin) SB Fresh Jeju Tang. Drink Tang. NO P.B 1.5lit', 'Store Brand (Woongjin) SB Fresh Jeju Tang. Drink Tang. NO P.B 1lit'],
        'ATTR1':['1A', '1A', '1B', '1B'],
        'ATTR2':['1A', '1C', '1B', '1B'],
        'ATTR3':['1A', '1A', '1B', '1B'],
        'ROW_INDEX_SIMILAR_ITEM':[1, -1, 1, 1]}

df = pd.DataFrame(data)

列“N”代表新项目。

我想计算新项目和 Jaccard 字符串相似度高项目之间的 'N'=='N' 行的属性匹配率(S_itemcode)

(ig 81527800(新品)-81320323, 81527900(新品)-81267337)

这是我想要的结果。

data1 = {'S_ITEMCODE':['', '81527800', '', '81527900'],
        'N':['N', '','N', ''],
        'ITEMCODE':['81527800', '81320323', '81527900', '81267337'],
        'DESC':['Store Brand (Woongjin) SB Fresh Orange Drink Orange NO P.BTL 1.5lit', 'Store Brand (Woongjin) SB Fresh Orange Drink Orange NO P.BTL 1lit', 'Store Brand (Woongjin) SB Fresh Jeju Tang. Drink Tang. NO P.B 1.5lit', 'Store Brand (Woongjin) SB Fresh Jeju Tang. Drink Tang. NO P.B 1lit'],
        'ATTR1':['1A', '1A', '1B', '1B'],
        'ATTR2':['1A', '1C', '1B', '1B'],
        'ATTR3':['1A', '1A', '1B', '1B'],
        'ROW_INDEX_SIMILAR_ITEM':[1, -1, 1, 1]}
        'ATTR_MATCHING_RATE':[2/3, '', 1, '']}

df = pd.DataFrame(data1)

请帮帮我...我卡住了...

标签: pythonpandasloopscomparisonrow

解决方案


这将为您提供所需的输出:

tested_cols = ['ATTR1', 'ATTR2', 'ATTR3']
df['matches'] = 0
for col in tested_cols:
    df.loc[(df['N'] == 'N') & (df[col] == df[col].shift(-1)), 'matches'] += 1
df['ATTR_MATCHING_RATE'] = df['matches'] / len(tested_cols)
df.drop('matches', axis=1, inplace=True)

推荐阅读