首页 > 解决方案 > 压缩方法并将数据放入一个热编码矩阵

问题描述

我的数据框包括购买。一个买家 ( buyer_id) 可以购买几件商品 ( item_id)。我将数据拆分splitter()并放入一个 dok 矩阵generate_matrix()中。然后我在方法中输入这些数据,get_train_samples()然后得到我x_trainx_test,y_trainy_test.

如何压缩此代码?以及如何将它们组合generate_matrix()get_train_samples()输入到“真正的”一个热编码矩阵中?

数据框

d = {'purchaseid': [0, 0, 0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 8, 9, 9, 9, 9],
     'itemid': [ 3, 8, 2, 10, 3, 10, 4, 12, 3, 12, 3, 4, 8, 6, 3, 0, 5, 12, 9, 9, 13, 1, 7, 11, 11]}
df = pd.DataFrame(data=d)

   purchaseid  itemid
0           0       3
1           0       8
2           0       2
3           1      10
4           2       3

代码

import random
import numpy as np
import pandas as pd
import scipy.sparse as sp


PERCENTAGE_SPLIT = 20
NUM_NEGATIVES = 4
def splitter(df):
  df_ = pd.DataFrame()
  sum_purchase = df['purchaseid'].nunique()
  amount = round((sum_purchase / 100) * PERCENTAGE_SPLIT)

  random_list = random.sample(df['purchaseid'].unique().tolist(), amount)
  df_ = df.loc[df['purchaseid'].isin(random_list)]
  df_reduced = df.loc[~df['purchaseid'].isin(random_list)]
  return [df_reduced, df_]

def generate_matrix(df_main, dataframe, name):
  
  mat = sp.dok_matrix((df_main.shape[0], len(df_main['itemid'].unique())), dtype=np.float32)
  for purchaseid, itemid in zip(dataframe['purchaseid'], dataframe['itemid']):
    mat[purchaseid, itemid] = 1.0

  return mat

dfs = splitter(df)
df_tr = dfs[0].copy(deep=True)
df_val = dfs[1].copy(deep=True)

train_mat = generate_matrix(df, df_tr, 'train')
val_mat = generate_matrix(df, df_val, 'val')

def get_train_samples(train_mat, num_negatives):
    user_input, item_input, labels = [], [], []
    num_user, num_item = train_mat.shape
    for (u, i) in train_mat.keys():
        user_input.append(u)
        item_input.append(i)
        labels.append(1)
        # negative instances
        for t in range(num_negatives):
            j = np.random.randint(num_item)
            while (u, j) in train_mat.keys():
                j = np.random.randint(num_item)
            user_input.append(u)
            item_input.append(j)
            labels.append(0)
    return user_input, item_input, labels

num_users, num_items = train_mat.shape

model = get_model(num_users, num_items, ...)

user_input, item_input, labels = get_train_samples(train_mat, NUM_NEGATIVES)
val_user_input, val_item_input, val_labels = get_train_samples(val_mat, NUM_NEGATIVES)

我需要的

标签: pythondataframematrixone-hot-encoding

解决方案


您正在寻找哪种热编码矩阵非常模糊。从get_train_samples看来,在一天结束时,您似乎并不真的需要稀疏矩阵来进行模型训练。此外,我不确定您将如何使用三个变量对观察结果进行一次热编码(user_id,item_id,purchased or not)

至于与 结合的问题,generate_matrixget_train_samples简单,

def generate_matrix(df_main,df,num_negatives):
    
    n_samples,n_classes = df_main.shape[0],df_main['itemid'].nunique()
    mat = sp.dok_matrix((n_samples,n_classes), dtype=np.float32)

    user_input,item_input,labels = [],[],[]
    for purchaseid,itemid in zip(df['purchaseid'],df['itemid']):
        
        mat[purchaseid,itemid] = 1.0
        # the data with label 0 in OP's original code 
        fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives)
        # label the fake labels with -1.0
        mat[np.repeat(purchaseid,num_negatives),fake_items] = -1.0
        
        # the three lists
        user_input.extend([purchaseid]*(num_negatives+1))
        item_input.append(itemid);item_input.extend(fake_items.tolist())
        labels.append(1.0);labels.extend(np.zeros(num_negatives).tolist())
        
    return mat,user_input,item_input,labels

如您所见,在我generate_matrix的 中,假样本(不是用户购买的商品)-1.0在稀疏矩阵中编码。此外,与代码中的循环相比,我使用了一种非常紧凑的方式fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives)来生成假。itemidwhile

使用此功能,您可以运行

import random
import numpy as np
import pandas as pd
import scipy.sparse as sp


d = {'purchaseid': [0, 0, 0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 8, 9, 9, 9, 9],
     'itemid': [ 3, 8, 2, 10, 3, 10, 4, 12, 3, 12, 3, 4, 8, 6, 3, 0, 5, 12, 9, 9, 13, 1, 7, 11, 11]}
df = pd.DataFrame(data=d)

PERCENTAGE_SPLIT = 20
NUM_NEGATIVES = 4

def splitter(df):
    
    df_ = pd.DataFrame()
    sum_purchase = df['purchaseid'].nunique()
    amount = round(sum_purchase*(PERCENTAGE_SPLIT/100))
    random_list = random.sample(df['purchaseid'].unique().tolist(),amount)
    df_ = df.loc[df['purchaseid'].isin(random_list)]
    df_reduced = df.loc[~df['purchaseid'].isin(random_list)]
    
    return [df_reduced, df_]

dfs = splitter(df)
df_tr = dfs[0].copy(deep=True)
df_val = dfs[1].copy(deep=True)

train_mat,user_input_t,item_id_t,labels_t = generate_matrix(df, df_tr, NUM_NEGATIVES)
val_mat,user_input_v,item_id_v,labels_v = generate_matrix(df, df_val, NUM_NEGATIVES)

您将通过运行此代码看到, 的长度train_mat.keys()可能与 的长度不同user_input_t。这是因为在 中可以多次选择相同的项目fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives)。如果要保持两个长度相同,则需要设置replacement=False.fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives)


推荐阅读