首页 > 解决方案 > 创建一个共现矩阵

问题描述

    | 0                 | 1                 | 2                 | 3
_______________________________________________________________________________
|0  | (-1.774, 1.145]   | (-3.21, 0.533]    |(0.0166, 2.007]    | (2.0, 3.997]
_______________________________________________________________________________
|1  | (-1.774, 1.145]   | (-3.21, 0.533]    | (2.007, 3.993]    | (2.0, 3.997]
_______________________________________________________________________________

我正在尝试创建一个像上面那样具有 800 条记录和 12 个分类变量的数据集的共现矩阵。我正在尝试从每个变量创建每个类别到其他变量的每个类别的共现矩阵

标签: pythonpython-3.x

解决方案


您可以使用OneHotEncoder()np.dot()

  1. 将数据框中的每个元素转换为字符串
  2. 使用 one-hot 编码器通过分类元素的唯一词汇表将数据帧转换为 one-hots
  3. 拿一个点积来计算共现次数
  4. 使用共现矩阵和feature_names来自一个热编码器的重新创建数据帧
#assuming this is your dataset
                 0               1                2             3
0  (-1.774, 1.145]  (-3.21, 0.533]  (0.0166, 2.007]  (2.0, 3.997]
1  (-1.774, 1.145]  (-3.21, 0.533]   (2.007, 3.993]  (2.0, 3.997]
from sklearn.preprocessing import OneHotEncoder

df = df.astype(str) #turn each element to string

#get one hot representation of the dataframe
l = OneHotEncoder() 
data = l.fit_transform(df.values)

#get co-occurance matrix using a dot product
co_occurance = np.dot(data.T, data)

#get vocab (columns and indexes) for co-occuance matrix
#get_feature_names() has a weird suffix which I am removing for better readibility here
vocab = [i[3:] for i in l.get_feature_names()]

#create co-occurance matrix
ddf = pd.DataFrame(co_occurance.todense(), columns=vocab, index=vocab)
print(ddf)
                 (-1.774, 1.145]  (-3.21, 0.533]  (0.0166, 2.007]  \
(-1.774, 1.145]              2.0             2.0              1.0   
(-3.21, 0.533]               2.0             2.0              1.0   
(0.0166, 2.007]              1.0             1.0              1.0   
(2.007, 3.993]               1.0             1.0              0.0   
(2.0, 3.997]                 2.0             2.0              1.0   

                 (2.007, 3.993]  (2.0, 3.997]  
(-1.774, 1.145]             1.0           2.0  
(-3.21, 0.533]              1.0           2.0  
(0.0166, 2.007]             0.0           1.0  
(2.007, 3.993]              1.0           1.0  
(2.0, 3.997]                1.0           2.0  

正如您可以从上面的输出中验证的那样,它正是共现矩阵应该是什么。

这种方法的优点是您可以使用transformone-hot 编码器对象的方法对其进行扩展,并且大部分处理都发生在稀疏矩阵中,直到创建数据帧的最后一步,从而提高内存效率。


推荐阅读