首页 > 解决方案 > Pandas:将数据框与距离矩阵连接起来

问题描述

我试图连接两个 Pandas DataFrame,但连接错误。

初始数据集如下所示:

df
>>>
            well    qoil    cum_oil         wct     top_perf    bot_perf    st  x       y
    5233    101     259     3.684131e+05    97      -2352.13    -2359.12    0   517228  5931024
    12786   102     3495    1.369303e+06    5.47    -2352.92    -2566.81    0   517192  5927187
    13062   103     2691    1.353718e+06    0.5     -2377.93    -2581.73    0   517731  5926430
    . . . .
65 rows × 9 columns

然后我从 x 和 y 坐标(最后两列)生成每个井之间的欧几里得距离:

from sklearn.neighbors import DistanceMetric
dist = DistanceMetric.get_metric('euclidean')
loc = pd.DataFrame(dist.pairwise(df[['x','y']].to_numpy()),
             columns=df.well.unique(), index=df.well.unique())

并接收 65x65 矩阵(pandas.core.frame.DataFrame 类型),其中包含每个井之间的距离

loc
>>>
    101         102         103         . . . 
101 0.000000    152.278917  270.835312  . . .
102 151.278917  0.000000    326.310146  . . .
103 270.835312  346.310146  0.000000    . . .
. . .

然后我删除额外的列并连接两个数据框:

df_train_prep = df.drop(['well', 'wct', 'x', 'y'], axis=1)
df2 = pd.concat([df_train_prep, loc], axis=1)

结果,我收到的不是 65 行 x (9 + 65) 列数据帧,而是130 行 x 70 列 df,例如:

df2
>>>
    qoil    cum_oil     top_perf    bot_perf    st  101 102 103 . . .
236 0.001   542681.0    -2427.66    -2539.25    0.0 NaN NaN NaN NaN NaN ... 
258 2291    292356.0    -2537.38    -2657.02    1.0 NaN NaN NaN NaN NaN ... 
537 3290    237163.0    -2714.32    -2741.49    0.0 NaN NaN NaN NaN NaN ... 
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
101 NaN NaN NaN NaN NaN 0.000000    157.278917  280.835312  323.423701  ...
102 NaN NaN NaN NaN NaN 154.278917  0.000000    356.310146  210.348200  518.786999  ... 

看起来有些数据在右侧连接,但有些数据移到了底部。此外,还弹出了奇怪的 NaN 值。请帮助我理解我做错了什么。

标签: pythonpandasdataframescikit-learneuclidean-distance

解决方案


# Dummy Data
df = pd.DataFrame({'x': range(5), 'y': range(5)})

# Pairewice euclidean distances 
from sklearn.metrics.pairwise import euclidean_distances
distance = pd.DataFrame(euclidean_distances(df[['x', 'y']]))

# Concatenate
df = pd.concat([df, distance], axis=1)
print (df)

输出:

    x   y   0           1           2           3           4
0   0   0   0.000000    1.414214    2.828427    4.242641    5.656854
1   1   1   1.414214    0.000000    1.414214    2.828427    4.242641
2   2   2   2.828427    1.414214    0.000000    1.414214    2.828427
3   3   3   4.242641    2.828427    1.414214    0.000000    1.414214
4   4   4   5.656854    4.242641    2.828427    1.414214    0.000000

正如您所看到的,parewise 距离是一个对称矩阵。


推荐阅读