python - 将大量 DataFrame 的多列与复杂的重复行进行比较
问题描述
df
我有一个包含大约 1000 万行的海量数据框:
df.sort_values(['pair','x1','x2'])
x1 x1gen x2 x2gen y1 y1gen y2 y2gen pair
-------------------------------------------------------------------------------
A male H female a male d male 0
A male W male a male d male 0 (*)
A male KK female a male d male 0 (**)
B female C male a male d male 0 (-)
B female W male a male d male 0 (*)
B female BB female a male d male 0
B female KK female a male d male 0 (**)
F male W male a male d male 0 (*)
A male T female b female d male 1
A male BB female b female d male 1
B female C male b female d male 1 (-)
D male E male b female d male 1
A male C male b female e female 2
...
每一列可以解释如下:
x1gen
是 的性别数据x1
,x2gen
是 的x2
,等等。x1
引用y1
和x2
引用y2
。- 每对
y1
和y2
都被分配了一个唯一的pair
值。
我的目标是为每个 unique 找到四个值pair
:
male
引用male
male
引用female
female
引用male
female
引用female
其中,每个引文网络不应被计算多次。
例如,在给定的样本中,在(见)x2 = W
中出现了 3 次,因此应该计算一次,而不是 3 次。同样适用于in (参见)。但是,如果它是一个新的对,我们可以计算相同的引用。( in分别计算一次and )pair = 0
(*)
x2 = KK
pair = 0
(**)
C -> d
(-)
pair = 0
pair = 1
因此,对于第一对pair = 0
,目标值为:
male
引用male
= 4 (A -> a, F -> a, W -> d, C -> d
)male
引用female
= 0female
引用male
= 4 (B -> a, H -> d, KK -> d, BB -> d
)female
引用female
= 0
我最初所做的是使用一个for
循环和一组循环,并分别为andif
创建四个列表:x1
x2
mm = [1]
mf = [0]
fm = [0]
ff = [0]
mm1 = 1
mf1 = 0
fm1 = 0
ff1 = 0
for i in range(1, len(df)):
if df['pair'][i] == df['pair'][i-1]:
if df['x1'][i] != df['x1'][i-1]:
if df['x1gen'][i] == 'male':
if df['y1gen'][i] == 'male':
mm1 += 1
else:
mf1 += 1
else:
if df['y1gen'][i] == 'male':
fm1 += 1
else:
ff1 += 1
...
并且要点是类似的(代码本身有很多行,但只是这些行的重复)。可以看出,这是非常低效的(大约需要 120 分钟)。
无需进行非常低效的字符串匹配即可找到此类值的最佳方法是什么?
解决方案
您可以尝试以下方法:
import io
import re
import pandas as pd
# this just recreates the dataframe
s = '''
x1 x1gen x2 x2gen y1 y1gen y2 y2gen pair
A male H female a male d male 0
A male W male a male d male 0
A male KK female a male d male 0
B female C male a male d male 0
B female W male a male d male 0
B female BB female a male d male 0
B female KK female a male d male 0
F male W male a male d male 0
A male T female b female d male 1
A male BB female b female d male 1
B female C male b female d male 1
D male E male b female d male 1
A male C male b female e female 2
'''
s = re.sub(r" +", " ", s)
df = pd.read_csv(io.StringIO(s), sep=" ")
print(df)
它给:
x1 x1gen x2 x2gen y1 y1gen y2 y2gen pair
0 A male H female a male d male 0
1 A male W male a male d male 0
2 A male KK female a male d male 0
3 B female C male a male d male 0
4 B female W male a male d male 0
5 B female BB female a male d male 0
6 B female KK female a male d male 0
7 F male W male a male d male 0
8 A male T female b female d male 1
9 A male BB female b female d male 1
10 B female C male b female d male 1
11 D male E male b female d male 1
12 A male C male b female e female 2
计算引用对:
# count x1-> y1 pairs
df1 = df.drop_duplicates(subset=['x1', 'y1', 'pair'])
c1 = (df1['x1gen'] + '_' + df1['y1gen']).value_counts()
# count x2-> y2 pairs
df2 = df.drop_duplicates(subset=['x2', 'y2', 'pair'])
c2 = (df2['x2gen'] + '_' + df2['y2gen']).value_counts()
# add results
c1.add(c2, fill_value=0).astype(int)
这给出了:
female_female 1
female_male 6
male_female 4
male_male 6
分别计算每一对的结果:
def cit_count(g):
# count x2-> y2 pairs
df1 = g.drop_duplicates(subset=['x1', 'y1'])
c1 = (df1['x1gen'] + '_' + df1['y1gen']).value_counts()
# count x2-> y2 pairs
df2 = g.drop_duplicates(subset=['x2', 'y2'])
c2 = (df2['x2gen'] + '_' + df2['y2gen']).value_counts()
# add results
return c1.add(c2, fill_value=0)
print(df.groupby('pair').apply(cit_count).unstack().fillna(0).astype(int))
它给:
female_female female_male male_female male_male
pair
0 0 4 0 4
1 1 2 2 2
2 0 0 2 0
推荐阅读
- dialogflow-es - 在意图之间共享参数
- function - 编译器说函数永远不会被调用,但它会被调用
- python - 代码自己执行 3 次而不告诉它这样做
- scala - 如何在状态机编码中缩小上限类型参数的类型?
- c# - Blazor:使用本地 json 文件作为存储
- axon - 有没有办法直接从测试夹具测试事件而不使用expectEvents?
- php - php登录不会在服务器上重定向
- excel - 按下 Enter 键时如何防止光标移动到下一个文本框?
- jquery - 在 jquery ajax 中的某些内容上显示和隐藏 gif 图像
- python - 在python中生成一个不重复的随机数而不给出范围