python - 如何使用循环和条件语句迭代 2 个数据帧以创建新数据帧
问题描述
我有 2 个数据框,我想使用循环来执行以下操作。
- 选择一个簇列来分析并删除两个表中的所有其他簇列
例如:如果分析 Clust 1,在 df 中,只应保留索引、G-pairs、G_A、G_B 和 Clust 1。 在 df2 中,只有 Clust _1_n 和 cluster_1_l 应该保留。
遍历 df 和 df2 的每一行,如果 df 中的 G_A 和 df2 中的 clust_1_n 之间存在 ID 匹配,则向 df 添加一个名为“G_A_val”的新列并输入关联的 Clust_1_l 值
遍历 df 和 df2 的每一行,如果 df 中的 G_B 和 df2 中的 clust_1_n 之间存在 ID 匹配,则向 df 添加一个标题为“G_B_val”的新列并输入关联的 Clust_1_l 值
向 df 添加一个名为 sums 的新列,其中包含每行的 G_A_val 和 G_B_val 的总和
最后,应用一个条件,如果 Clust 1 列中的一行 >= 0.051,则将 df 中该行的 sums 列值替换为 0
** 我可以在没有循环的情况下执行这些任务,但我正在尝试学习如何在数据帧中使用循环。
这些是数据框
import pandas as pd
# initialise data of lists.
data = {'g_pairs':['an_jk', 'tf_ha', 'ab_rh', 'et_x2','yr_po'],
'g_a':['en762','en72b','en925','en980','en009'],
'g_b':['en361','en231','en666','en771','en909'],
'clust1':[0.020,1,0.05,0.7,0.001],
'clust2':[1,1,0.1,0.0001,1],
'clust3':[0.001,0.002,0.0021,0.3,0.005],
'clust4':[0.0002,0.0043,0.0067,0.0123,0.0110],
'clust5':[0.7500,0.0540,0.0030,1,0.0081]}
# Creates pandas DataFrame.
df = pd.DataFrame(data)
# print the data
df
g_pairs g_a g_b clust1 clust2 clust3 clust4 clust5
0 an_jk en762 en361 0.020 1.0000 0.0010 0.0002 0.7500
1 tf_ha en72b en231 1.000 1.0000 0.0020 0.0043 0.0540
2 ab_rh en925 en666 0.050 0.1000 0.0021 0.0067 0.0030
3 et_x2 en980 en771 0.700 0.0001 0.3000 0.0123 1.0000
4 yr_po en009 en909 0.001 1.0000 0.0050 0.0110 0.0081
data2 = {'clust_1n':['en762','en72b','en925','en980','en009','en361','en231','en666','en771','en909'],
'clust_1l':[0.35,0.30,0.004,0.58,0.55,0.94,0.78,0.62,0.366,0.01],
'clust_2n':['en762','en72b','en925','en980','en009','en361','en231','en666','en771','en909'],
'clust_2l':[0.37,0.0130,0.385,0.57,0.6,0.19,0.41,0.92,0.46,0.44],
'clust_3n':['en762','en72b','en925','en980','en009','en361','en231','en666','en771','en909'],
'clust_3l':[0.29,0.01130,0.2385,0.457,0.16,0.149,0.411,0.992,0.146,0.944],
'clust_4n':['en762','en72b','en925','en980','en009','en361','en231','en666','en771','en909'],
'clust_4l':[0.529,0.0001130,0.12385,0.6457,0.816,0.9149,0.3411,0.0992,0.3146,0.0944],
'clust_5n':['en762','en72b','en925','en980','en009','en361','en231','en666','en771','en909'],
'clust_5l':[0.229,0.001130,0.12385,0.4457,0.616,0.1549,0.4911,0.0992,0.9146,0.2944]}
# Creates pandas DataFrame.
df2 = pd.DataFrame(data2)
# print the data
df2
clust_1n clust_1l clust_2n clust_2l clust_3n clust_3l clust_4n clust_4l clust_5n clust_5l
0 en762 0.350 en762 0.370 en762 0.2900 en762 0.529000 en762 0.22900
1 en72b 0.300 en72b 0.013 en72b 0.0113 en72b 0.000113 en72b 0.00113
2 en925 0.004 en925 0.385 en925 0.2385 en925 0.123850 en925 0.12385
3 en980 0.580 en980 0.570 en980 0.4570 en980 0.645700 en980 0.44570
4 en009 0.550 en009 0.600 en009 0.1600 en009 0.816000 en009 0.61600
5 en361 0.940 en361 0.190 en361 0.1490 en361 0.914900 en361 0.15490
6 en231 0.780 en231 0.410 en231 0.4110 en231 0.341100 en231 0.49110
7 en666 0.620 en666 0.920 en666 0.9920 en666 0.099200 en666 0.09920
8 en771 0.366 en771 0.460 en771 0.1460 en771 0.314600 en771 0.91460
9 en909 0.010 en909 0.440 en909 0.9440 en909 0.094400 en909 0.29440
这就是输出的样子
g_pairs g_a g_b clust1 g_a_val g_b_val sums
0 an_jk en762 en361 0.020 0.350 0.940 1.29
1 tf_ha en72b en231 1.000 0.300 0.780 0.00
2 ab_rh en925 en666 0.050 0.004 0.620 0.97
3 et_x2 en980 en771 0.700 0.580 0.366 0.00
4 yr_po en009 en909 0.001 0.550 0.010 0.56
非常感谢您提前。
解决方案
编辑:对不起,我错过了
“我可以在没有循环的情况下执行这些任务,但我正在尝试学习如何在数据帧中使用循环。”
这是否意味着我下面的第一个答案没用?:)) 如果可以避免,您不应该遍历 DataFrames!但是,如果您真的想这样做,这将是一种方法:
cluster = 1
df_res = df[["g_pairs", "g_a", "g_b", f"clust{cluster}"]].assign(g_a_val=0, g_b_val=0, sum=0)
for i, g_a, g_b, clust in df[["g_a", "g_b", f"clust{cluster}"]].itertuples():
for clust_n, clust_l in df2[[f"clust_{cluster}n", f"clust_{cluster}l"]].itertuples(index=False):
if g_a == clust_n:
g_a_val = clust_l
df_res.loc[i, "g_a_val"] = g_a_val
break
for clust_n, clust_l in df2[[f"clust_{cluster}n", f"clust_{cluster}l"]].itertuples(index=False):
if g_b == clust_n:
g_b_val = clust_l
df_res.loc[i, "g_b_val"] = g_b_val
break
if clust < 0.51:
df_res.loc[i, "sum"] = g_a_val + g_b_val
您可以执行以下操作:
cluster = 1
df_res = (
df[["g_pairs", "g_a", "g_b", f"clust{cluster}"]]
.merge(
df2[[f"clust_{cluster}n", f"clust_{cluster}l"]],
left_on="g_a",
right_on=f"clust_{cluster}n",
how="left"
)
.drop(columns=f"clust_{cluster}n")
.rename(columns={f"clust_{cluster}l": "g_a_val"})
.merge(
df2[[f"clust_{cluster}n", f"clust_{cluster}l"]],
left_on="g_b",
right_on=f"clust_{cluster}n",
how="left"
)
.drop(columns=f"clust_{cluster}n")
.rename(columns={f"clust_{cluster}l": "g_b_val"})
)
df_res["sum"] = df_res.g_a_val + df_res.g_b_val
df_res.loc[df[f"clust{cluster}"] >= 0.51, "sum"] = 0
结果:
g_pairs g_a g_b clust1 g_a_val g_b_val sum
0 an_jk en762 en361 0.020 0.350 0.940 1.290
1 tf_ha en72b en231 1.000 0.300 0.780 0.000
2 ab_rh en925 en666 0.050 0.004 0.620 0.624
3 et_x2 en980 en771 0.700 0.580 0.366 0.000
4 yr_po en009 en909 0.001 0.550 0.010 0.560
推荐阅读
- python - How to iterate this n dimensional array?
- iso8583 - difference between iso8583 authorize request and authorize advice request
- ruby-on-rails - Ruby On Rails - "OpenSSL::Cipher::CipherError at ..." when connecting to imported database
- python-3.x - How to improve SpaCy matcher pattern
- wordpress - Can we replicate ACF functionality in our theme?
- r - R问题:从字符串结尾提取数字字符
- spring - 如何跳过某些路线的安全检查?
- java - 即使函数签名抛出它,编译器也会抱怨未处理的 IOException
- rust - 性能分析
- python - 如何在 Django 模板语言中进行算术运算后将值重新分配给变量