首页 > 解决方案 > 根据两列比较和排列数据框中的行?

问题描述

我试图弄清楚如何根据两个条件比较和排名熊猫数据框中的多行。

这些是条件:

rule1 < rule2 

if support(rule1) <= support(rule2) and confidence(rule1) < confidence(rule2) 

or support(rule1) < support(rule2) and confidence(rule1) <= confidence(rule2)

    
rule1 = rule2 

if support(rule1) = support(rule2) and confidence(rule1) = confidence(rule2)

这就是我的数据框的设置方式:

import pandas as pd

data = {
'rules': [(4444, 5555), (8747, 1254), (7414, 1214), (5655, 6651), (4454, 3321), (4893, 4923), (1271, 8330), (9112, 4722), (4511, 6722), (1102, 5789), (2340, 5720), (9822, 5067)],
'support': [0.0048, 0.00141, 0.0085, 0.00106, 0.00106, 0.00038, 0.00179, 0.00913, 0.00221, 0.00173, 0.00098, 0.00024],
'confidence': [0.873015, 0.533333, 0.593220, 0.012060, 0.012060, 0.237699, 0.453423, 0.097672, 0.116983, 0.541221, 0.743222, 0.378219]
}

df = pd.DataFrame(data=data, index=data['rules']).drop(columns=['rules'])

   
  (Index)
   Rules       Support     Confidence
(4444, 5555)   0.0048      0.873015
(8747, 1254)   0.00141     0.533333
(7414, 1214)   0.0085      0.593220
(5655, 6651)   0.00106     0.012060
(4454, 3321)   0.00106     0.012060
(4893, 4923)   0.00038     0.237699
(1271, 8330)   0.00179     0.453423
(9112, 4722)   0.00913     0.097672
(4511, 6722)   0.00221     0.116983
(1102, 5789)   0.00173     0.541221
(2340, 5720)   0.00098     0.743222
(9822, 5067)   0.00024     0.378219

这就是我想要的数据框的外观(不确定排名到底是什么......这是假设的排名)

   (Index)
    Rules      Support     Confidence    Rank
(7414, 1214)   0.0085      0.593220        1
(4444, 5555)   0.0048      0.873015        2
(5655, 6651)   0.00106     0.012060        3
(4454, 3321)   0.00106     0.012060        3
(8747, 1254)   0.00141     0.533333        4
(1271, 8330)   0.00179     0.453423        5
(1102, 5789)   0.00173     0.541221        6
(2340, 5720)   0.00098     0.743222        7
(9822, 5067)   0.00024     0.378219        8
(9112, 4722)   0.00913     0.097672        9
(4511, 6722)   0.00221     0.116983        10
(4893, 4923)   0.00038     0.237699        11

我对如何让这段代码工作有一些想法,但我不确定如何将每条规则与每条规则进行比较。我希望根据条件浮动到顶部的最佳规则。它不是一个大数据框(< 1000),所以我并不关心速度只是准确性。

这是我到目前为止得到的代码:

def rank_rules(confidence, support):

    # IF / ELSE goes here
   
    df['rank'] = some_var.rank(method='max')
  
    df.sort_values(by=['rank'], ascending=False)

    return df


df = df.apply(lambda x: rank_rules(x['confidence'], x['support']), axis=1)
 

标签: pythonpandasanalyticsranking

解决方案


解决方案:建议的方法

如果我理解正确,您正在尝试创建一个基于多列(支持信心)的排名系统。您可以将这两个视为散点图上的两个正交轴 ( x, )。y在没有进一步排序逻辑的情况下,我将假设欧几里得距离是我们可以在这里用来对行进行排序以创建等级的方法。

处理数据

我在这里展示了使用 aMinMaxScaler可能是一个选项(除了可选地使用zscore)。

代码

import numpy as np
import pandas as pd
import matplotlib.pylab as plt
from sklearn.preprocessing import MinMaxScaler

%matplotlib inline 
%config InlineBackend.figure_format = 'svg' # 'svg', 'retina' 
plt.style.use('seaborn-white')

df = df.reset_index(drop=False).rename(columns={'index': 'rules'})
df['distance'] = (df.support**2 + df.confidence**2)**0.5
df['zsupport'] = (df.support - df.support.mean())/df.support.std()
df['zconfidence'] = (df.confidence - df.confidence.mean())/df.confidence.std()
df['zdistance'] = (df.zsupport**2 + df.zconfidence**2)**0.5

round_strategy = {
    'support': 5,
    'confidence': 6,
    'distance': 5,
}

scaler = MinMaxScaler()
df2 = pd.DataFrame(scaler.fit_transform(df[['zsupport', 'zconfidence']]), 
                   columns=['scaled_support', 'scaled_confidence'])
df = pd.concat([df, df2], ignore_index=False, axis=1)
df['scaled_distance'] = (df.scaled_support**2 + df.scaled_confidence**2)**0.5
df = df.sort_values(['scaled_distance'], ascending=False).reset_index(drop=True)
df['Rank'] = df.index

decimals = dict()
for col in df.columns:
    for key, value in round_strategy.items():
        if key in col:
            decimals.update({col: value})
df = df.round(decimals=decimals)

sizes = (df.shape[0] - df.Rank)/df.shape[0]
colors = round(255*sizes).astype(int)
df

在此处输入图像描述

阴谋

import plotly.express as px

fig = px.scatter(df4, x="scaled_support", y="scaled_confidence", text="Rank", 
                  log_x=False, size_max=20, 
                  color="Rank", 
                  size=(np.arange(df4.index.size) + 4)[::-1], 
                  hover_data=df4.columns)
fig.update_traces(textposition='top center')
fig.update_layout(title_text='Support vs. Confidence with Rank', title_x=0.5)
fig.show()

在此处输入图像描述

虚拟数据

import pandas as pd

data = {
'rules': [(4444, 5555), (8747, 1254), (7414, 1214), (5655, 6651), (4454, 3321), (4893, 4923), (1271, 8330), (9112, 4722), (4511, 6722), (1102, 5789), (2340, 5720), (9822, 5067)],
'support': [0.0048, 0.00141, 0.0085, 0.00106, 0.00106, 0.00038, 0.00179, 0.00913, 0.00221, 0.00173, 0.00098, 0.00024],
'confidence': [0.873015, 0.533333, 0.593220, 0.012060, 0.012060, 0.237699, 0.453423, 0.097672, 0.116983, 0.541221, 0.743222, 0.378219]
}

df = pd.DataFrame(data=data, index=data['rules']).drop(columns=['rules'])

推荐阅读