首页 > 解决方案 > 将大量 DataFrame 的多列与复杂的重复行进行比较

问题描述

df我有一个包含大约 1000 万行的海量数据框:

df.sort_values(['pair','x1','x2'])


x1    x1gen    x2     x2gen     y1     y1gen     y2     y2gen      pair
-------------------------------------------------------------------------------
A     male     H      female    a      male      d      male       0
A     male     W      male      a      male      d      male       0         (*)
A     male     KK     female    a      male      d      male       0         (**)
B     female   C      male      a      male      d      male       0         (-)
B     female   W      male      a      male      d      male       0         (*)
B     female   BB     female    a      male      d      male       0
B     female   KK     female    a      male      d      male       0         (**)
F     male     W      male      a      male      d      male       0         (*)
A     male     T      female    b      female    d      male       1
A     male     BB     female    b      female    d      male       1
B     female   C      male      b      female    d      male       1         (-)
D     male     E      male      b      female    d      male       1
A     male     C      male      b      female    e      female     2
...

每一列可以解释如下:

我的目标是为每个 unique 找到四个值pair

  1. male引用male
  2. male引用female
  3. female引用male
  4. female引用female

其中,每个引文网络不应被计算多次。

例如,在给定的样本中,在(见)x2 = W中出现了 3 次,因此应该计算一次,而不是 3 次。同样适用于in (参见)。但是,如果它是一个新的对,我们可以计算相同的引用。( in分别计算一次and )pair = 0(*)x2 = KKpair = 0(**)C -> d(-)pair = 0pair = 1

因此,对于第一对pair = 0,目标值为:

  1. male引用male= 4 ( A -> a, F -> a, W -> d, C -> d)
  2. male引用female= 0
  3. female引用male= 4 ( B -> a, H -> d, KK -> d, BB -> d)
  4. female引用female= 0

我最初所做的是使用一个for循环和一组循环,并分别为andif创建四个列表:x1x2

mm = [1]
mf = [0]
fm = [0]
ff = [0]

mm1 = 1
mf1 = 0
fm1 = 0
ff1 = 0

for i in range(1, len(df)):
  if df['pair'][i] == df['pair'][i-1]:
    if df['x1'][i] != df['x1'][i-1]:
      if df['x1gen'][i] == 'male':
        if df['y1gen'][i] == 'male':
          mm1 += 1
        else:
          mf1 += 1
      else:
        if df['y1gen'][i] == 'male':
          fm1 += 1
        else:
          ff1 += 1
...

并且要点是类似的(代码本身有很多行,但只是这些行的重复)。可以看出,这是非常低效的(大约需要 120 分钟)。

无需进行非常低效的字符串匹配即可找到此类值的最佳方法是什么?

标签: pythonpandasdataframe

解决方案


您可以尝试以下方法:

import io  
import re
import pandas as pd

# this just recreates the dataframe 
s = '''
x1    x1gen    x2     x2gen     y1     y1gen     y2     y2gen      pair
A     male     H      female    a      male      d      male       0
A     male     W      male      a      male      d      male       0
A     male     KK     female    a      male      d      male       0
B     female   C      male      a      male      d      male       0
B     female   W      male      a      male      d      male       0
B     female   BB     female    a      male      d      male       0
B     female   KK     female    a      male      d      male       0
F     male     W      male      a      male      d      male       0
A     male     T      female    b      female    d      male       1
A     male     BB     female    b      female    d      male       1
B     female   C      male      b      female    d      male       1
D     male     E      male      b      female    d      male       1
A     male     C      male      b      female    e      female     2
'''

s = re.sub(r" +", " ", s)
df = pd.read_csv(io.StringIO(s), sep=" ")
print(df)

它给:

   x1   x1gen  x2   x2gen y1   y1gen y2   y2gen  pair
0   A    male   H  female  a    male  d    male     0
1   A    male   W    male  a    male  d    male     0
2   A    male  KK  female  a    male  d    male     0
3   B  female   C    male  a    male  d    male     0
4   B  female   W    male  a    male  d    male     0
5   B  female  BB  female  a    male  d    male     0
6   B  female  KK  female  a    male  d    male     0
7   F    male   W    male  a    male  d    male     0
8   A    male   T  female  b  female  d    male     1
9   A    male  BB  female  b  female  d    male     1
10  B  female   C    male  b  female  d    male     1
11  D    male   E    male  b  female  d    male     1
12  A    male   C    male  b  female  e  female     2

计算引用对:

# count x1-> y1 pairs
df1 = df.drop_duplicates(subset=['x1', 'y1', 'pair'])
c1 = (df1['x1gen'] + '_' + df1['y1gen']).value_counts()

# count x2-> y2 pairs
df2 = df.drop_duplicates(subset=['x2', 'y2', 'pair'])
c2 = (df2['x2gen'] + '_' + df2['y2gen']).value_counts()

# add results
c1.add(c2, fill_value=0).astype(int)

这给出了:

female_female    1
female_male      6
male_female      4
male_male        6

分别计算每一对的结果:

def cit_count(g):

    # count x2-> y2 pairs
    df1 = g.drop_duplicates(subset=['x1', 'y1'])
    c1 = (df1['x1gen'] + '_' + df1['y1gen']).value_counts()

    # count x2-> y2 pairs
    df2 = g.drop_duplicates(subset=['x2', 'y2'])
    c2 = (df2['x2gen'] + '_' + df2['y2gen']).value_counts()

    # add results
    return c1.add(c2, fill_value=0)

print(df.groupby('pair').apply(cit_count).unstack().fillna(0).astype(int))

它给:

      female_female  female_male  male_female  male_male
pair                                                    
0                 0            4            0          4
1                 1            2            2          2
2                 0            0            2          0

推荐阅读