首页 > 解决方案 > 如何使用循环和条件语句迭代 2 个数据帧以创建新数据帧

问题描述

我有 2 个数据框,我想使用循环来执行以下操作。

  1. 选择一个簇列来分析并删除两个表中的所有其他簇列

例如:如果分析 Clust 1,在 df 中,只应保留索引、G-pairs、G_A、G_B 和 Clust 1。 在 df2 中,只有 Clust _1_n 和 cluster_1_l 应该保留。

  1. 遍历 df 和 df2 的每一行,如果 df 中的 G_A 和 df2 中的 clust_1_n 之间存在 ID 匹配,则向 df 添加一个名为“G_A_val”的新列并输入关联的 Clust_1_l 值

  2. 遍历 df 和 df2 的每一行,如果 df 中的 G_B 和 df2 中的 clust_1_n 之间存在 ID 匹配,则向 df 添加一个标题为“G_B_val”的新列并输入关联的 Clust_1_l 值

  3. 向 df 添加一个名为 sums 的新列,其中包含每行的 G_A_val 和 G_B_val 的总和

  4. 最后,应用一个条件,如果 Clust 1 列中的一行 >= 0.051,则将 df 中该行的 sums 列值替换为 0

** 我可以在没有循环的情况下执行这些任务,但我正在尝试学习如何在数据帧中使用循环。

这些是数据框

import pandas as pd
 
# initialise data of lists.
data = {'g_pairs':['an_jk', 'tf_ha', 'ab_rh', 'et_x2','yr_po'],
        'g_a':['en762','en72b','en925','en980','en009'],
        'g_b':['en361','en231','en666','en771','en909'],
        'clust1':[0.020,1,0.05,0.7,0.001],
        'clust2':[1,1,0.1,0.0001,1],
        'clust3':[0.001,0.002,0.0021,0.3,0.005],
        'clust4':[0.0002,0.0043,0.0067,0.0123,0.0110],
        'clust5':[0.7500,0.0540,0.0030,1,0.0081]}
 
# Creates pandas DataFrame.
df = pd.DataFrame(data)
 
# print the data
df
    g_pairs g_a     g_b     clust1  clust2  clust3  clust4  clust5
0   an_jk   en762   en361   0.020   1.0000  0.0010  0.0002  0.7500
1   tf_ha   en72b   en231   1.000   1.0000  0.0020  0.0043  0.0540
2   ab_rh   en925   en666   0.050   0.1000  0.0021  0.0067  0.0030
3   et_x2   en980   en771   0.700   0.0001  0.3000  0.0123  1.0000
4   yr_po   en009   en909   0.001   1.0000  0.0050  0.0110  0.0081
data2 = {'clust_1n':['en762','en72b','en925','en980','en009','en361','en231','en666','en771','en909'],
         'clust_1l':[0.35,0.30,0.004,0.58,0.55,0.94,0.78,0.62,0.366,0.01],
         'clust_2n':['en762','en72b','en925','en980','en009','en361','en231','en666','en771','en909'],
         'clust_2l':[0.37,0.0130,0.385,0.57,0.6,0.19,0.41,0.92,0.46,0.44],
         'clust_3n':['en762','en72b','en925','en980','en009','en361','en231','en666','en771','en909'],
         'clust_3l':[0.29,0.01130,0.2385,0.457,0.16,0.149,0.411,0.992,0.146,0.944],
         'clust_4n':['en762','en72b','en925','en980','en009','en361','en231','en666','en771','en909'],
         'clust_4l':[0.529,0.0001130,0.12385,0.6457,0.816,0.9149,0.3411,0.0992,0.3146,0.0944],
         'clust_5n':['en762','en72b','en925','en980','en009','en361','en231','en666','en771','en909'],
         'clust_5l':[0.229,0.001130,0.12385,0.4457,0.616,0.1549,0.4911,0.0992,0.9146,0.2944]}
        
 
# Creates pandas DataFrame.
df2 = pd.DataFrame(data2)
 
# print the data
df2
clust_1n    clust_1l    clust_2n    clust_2l    clust_3n    clust_3l    clust_4n    clust_4l    clust_5n    clust_5l
0   en762   0.350   en762   0.370   en762   0.2900  en762   0.529000    en762   0.22900
1   en72b   0.300   en72b   0.013   en72b   0.0113  en72b   0.000113    en72b   0.00113
2   en925   0.004   en925   0.385   en925   0.2385  en925   0.123850    en925   0.12385
3   en980   0.580   en980   0.570   en980   0.4570  en980   0.645700    en980   0.44570
4   en009   0.550   en009   0.600   en009   0.1600  en009   0.816000    en009   0.61600
5   en361   0.940   en361   0.190   en361   0.1490  en361   0.914900    en361   0.15490
6   en231   0.780   en231   0.410   en231   0.4110  en231   0.341100    en231   0.49110
7   en666   0.620   en666   0.920   en666   0.9920  en666   0.099200    en666   0.09920
8   en771   0.366   en771   0.460   en771   0.1460  en771   0.314600    en771   0.91460
9   en909   0.010   en909   0.440   en909   0.9440  en909   0.094400    en909   0.29440

这就是输出的样子

   g_pairs  g_a     g_b     clust1  g_a_val g_b_val sums
0   an_jk   en762   en361   0.020   0.350   0.940   1.29
1   tf_ha   en72b   en231   1.000   0.300   0.780   0.00
2   ab_rh   en925   en666   0.050   0.004   0.620   0.97
3   et_x2   en980   en771   0.700   0.580   0.366   0.00
4   yr_po   en009   en909   0.001   0.550   0.010   0.56

非常感谢您提前。

标签: pythondataframefor-loopconditional-statements

解决方案


编辑:对不起,我错过了

“我可以在没有循环的情况下执行这些任务,但我正在尝试学习如何在数据帧中使用循环。”

这是否意味着我下面的第一个答案没用?:)) 如果可以避免,您不应该遍历 DataFrames!但是,如果您真的想这样做,这将是一种方法:

cluster = 1
df_res = df[["g_pairs", "g_a", "g_b", f"clust{cluster}"]].assign(g_a_val=0, g_b_val=0, sum=0)
for i, g_a, g_b, clust in df[["g_a", "g_b", f"clust{cluster}"]].itertuples():
    for clust_n, clust_l in df2[[f"clust_{cluster}n", f"clust_{cluster}l"]].itertuples(index=False):
        if g_a == clust_n:
            g_a_val = clust_l
            df_res.loc[i, "g_a_val"] = g_a_val
            break
    for clust_n, clust_l in df2[[f"clust_{cluster}n", f"clust_{cluster}l"]].itertuples(index=False):
        if g_b == clust_n:
            g_b_val = clust_l
            df_res.loc[i, "g_b_val"] = g_b_val
            break
    if clust < 0.51:
        df_res.loc[i, "sum"] = g_a_val + g_b_val

您可以执行以下操作:

cluster = 1
df_res = (
    df[["g_pairs", "g_a", "g_b", f"clust{cluster}"]]
    .merge(
        df2[[f"clust_{cluster}n", f"clust_{cluster}l"]],
        left_on="g_a",
        right_on=f"clust_{cluster}n",
        how="left"
    )
    .drop(columns=f"clust_{cluster}n")
    .rename(columns={f"clust_{cluster}l": "g_a_val"})
    .merge(
        df2[[f"clust_{cluster}n", f"clust_{cluster}l"]],
        left_on="g_b",
        right_on=f"clust_{cluster}n",
        how="left"
    )
    .drop(columns=f"clust_{cluster}n")
    .rename(columns={f"clust_{cluster}l": "g_b_val"})
)
df_res["sum"] = df_res.g_a_val + df_res.g_b_val
df_res.loc[df[f"clust{cluster}"] >= 0.51, "sum"] = 0

结果:

  g_pairs    g_a    g_b  clust1  g_a_val  g_b_val    sum
0   an_jk  en762  en361   0.020    0.350    0.940  1.290
1   tf_ha  en72b  en231   1.000    0.300    0.780  0.000
2   ab_rh  en925  en666   0.050    0.004    0.620  0.624
3   et_x2  en980  en771   0.700    0.580    0.366  0.000
4   yr_po  en009  en909   0.001    0.550    0.010  0.560

推荐阅读