python - 压缩方法并将数据放入一个热编码矩阵
问题描述
我的数据框包括购买。一个买家 ( buyer_id
) 可以购买几件商品 ( item_id
)。我将数据拆分splitter()
并放入一个 dok 矩阵generate_matrix()
中。然后我在方法中输入这些数据,get_train_samples()
然后得到我x_train
的x_test
,y_train
和y_test
.
如何压缩此代码?以及如何将它们组合generate_matrix()
并get_train_samples()
输入到“真正的”一个热编码矩阵中?
数据框:
d = {'purchaseid': [0, 0, 0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 8, 9, 9, 9, 9],
'itemid': [ 3, 8, 2, 10, 3, 10, 4, 12, 3, 12, 3, 4, 8, 6, 3, 0, 5, 12, 9, 9, 13, 1, 7, 11, 11]}
df = pd.DataFrame(data=d)
purchaseid itemid
0 0 3
1 0 8
2 0 2
3 1 10
4 2 3
代码:
import random
import numpy as np
import pandas as pd
import scipy.sparse as sp
PERCENTAGE_SPLIT = 20
NUM_NEGATIVES = 4
def splitter(df):
df_ = pd.DataFrame()
sum_purchase = df['purchaseid'].nunique()
amount = round((sum_purchase / 100) * PERCENTAGE_SPLIT)
random_list = random.sample(df['purchaseid'].unique().tolist(), amount)
df_ = df.loc[df['purchaseid'].isin(random_list)]
df_reduced = df.loc[~df['purchaseid'].isin(random_list)]
return [df_reduced, df_]
def generate_matrix(df_main, dataframe, name):
mat = sp.dok_matrix((df_main.shape[0], len(df_main['itemid'].unique())), dtype=np.float32)
for purchaseid, itemid in zip(dataframe['purchaseid'], dataframe['itemid']):
mat[purchaseid, itemid] = 1.0
return mat
dfs = splitter(df)
df_tr = dfs[0].copy(deep=True)
df_val = dfs[1].copy(deep=True)
train_mat = generate_matrix(df, df_tr, 'train')
val_mat = generate_matrix(df, df_val, 'val')
def get_train_samples(train_mat, num_negatives):
user_input, item_input, labels = [], [], []
num_user, num_item = train_mat.shape
for (u, i) in train_mat.keys():
user_input.append(u)
item_input.append(i)
labels.append(1)
# negative instances
for t in range(num_negatives):
j = np.random.randint(num_item)
while (u, j) in train_mat.keys():
j = np.random.randint(num_item)
user_input.append(u)
item_input.append(j)
labels.append(0)
return user_input, item_input, labels
num_users, num_items = train_mat.shape
model = get_model(num_users, num_items, ...)
user_input, item_input, labels = get_train_samples(train_mat, NUM_NEGATIVES)
val_user_input, val_item_input, val_labels = get_train_samples(val_mat, NUM_NEGATIVES)
我需要的
user_input
item_input
labels
val_user_input
val_item_input
val_labels
num_users
解决方案
您正在寻找哪种热编码矩阵非常模糊。从get_train_samples
看来,在一天结束时,您似乎并不真的需要稀疏矩阵来进行模型训练。此外,我不确定您将如何使用三个变量对观察结果进行一次热编码(user_id,item_id,purchased or not)
至于与 结合的问题,generate_matrix
很get_train_samples
简单,
def generate_matrix(df_main,df,num_negatives):
n_samples,n_classes = df_main.shape[0],df_main['itemid'].nunique()
mat = sp.dok_matrix((n_samples,n_classes), dtype=np.float32)
user_input,item_input,labels = [],[],[]
for purchaseid,itemid in zip(df['purchaseid'],df['itemid']):
mat[purchaseid,itemid] = 1.0
# the data with label 0 in OP's original code
fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives)
# label the fake labels with -1.0
mat[np.repeat(purchaseid,num_negatives),fake_items] = -1.0
# the three lists
user_input.extend([purchaseid]*(num_negatives+1))
item_input.append(itemid);item_input.extend(fake_items.tolist())
labels.append(1.0);labels.extend(np.zeros(num_negatives).tolist())
return mat,user_input,item_input,labels
如您所见,在我generate_matrix
的 中,假样本(不是用户购买的商品)-1.0
在稀疏矩阵中编码。此外,与代码中的循环相比,我使用了一种非常紧凑的方式fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives)
来生成假。itemid
while
使用此功能,您可以运行
import random
import numpy as np
import pandas as pd
import scipy.sparse as sp
d = {'purchaseid': [0, 0, 0, 1, 2, 2, 3, 3, 3, 4, 4, 4, 5, 5, 5, 6, 6, 6, 7, 7, 8, 9, 9, 9, 9],
'itemid': [ 3, 8, 2, 10, 3, 10, 4, 12, 3, 12, 3, 4, 8, 6, 3, 0, 5, 12, 9, 9, 13, 1, 7, 11, 11]}
df = pd.DataFrame(data=d)
PERCENTAGE_SPLIT = 20
NUM_NEGATIVES = 4
def splitter(df):
df_ = pd.DataFrame()
sum_purchase = df['purchaseid'].nunique()
amount = round(sum_purchase*(PERCENTAGE_SPLIT/100))
random_list = random.sample(df['purchaseid'].unique().tolist(),amount)
df_ = df.loc[df['purchaseid'].isin(random_list)]
df_reduced = df.loc[~df['purchaseid'].isin(random_list)]
return [df_reduced, df_]
dfs = splitter(df)
df_tr = dfs[0].copy(deep=True)
df_val = dfs[1].copy(deep=True)
train_mat,user_input_t,item_id_t,labels_t = generate_matrix(df, df_tr, NUM_NEGATIVES)
val_mat,user_input_v,item_id_v,labels_v = generate_matrix(df, df_val, NUM_NEGATIVES)
您将通过运行此代码看到, 的长度train_mat.keys()
可能与 的长度不同user_input_t
。这是因为在 中可以多次选择相同的项目fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives)
。如果要保持两个长度相同,则需要设置replacement=False
.fake_items = np.random.choice(a=np.setdiff1d(range(n_classes),itemid),size=num_negatives)
推荐阅读
- javascript - CSS @keyframe 动画删除整个按钮,而不仅仅是边框
- jenkins - 如何从我的 GitHub 拉取请求中删除“持续集成/jenkins/pr-merge”和“持续集成/jenkins/branch”检查?
- javascript - 如何创建没有任何循环的二维数组?
- javascript - 存储和检索用户访问信息的最佳和最安全的地方
- java - MySQL 表错误中的自动增量键
- java - 无法在 Netbeans 8.2 中创建 Maven Web 项目
- tensorflow - 如何使用 tensorflow 2 从 keras 模型中获取评估梯度?
- vim - How to map ":f" to "1
"? - java - 在Android中使用restTemplate.getForObject反序列化JSON对象的正确方法
- java - How to send/receive a date object to be in a different format in JSON and not in timestamp?