首页 > 解决方案 > 查找最近邻变化算法

问题描述

我正在创建一个推荐系统,向用户推荐 20 首最合适的歌曲。我已经训练了我的模型,我已经准备好为给定的播放列表推荐歌曲了!但是,我遇到的一个问题是我需要嵌入该新播放列表,以便使用 kmeans 在该嵌入空间中找到最接近的相关播放列表。

为了推荐歌曲,我首先对所有训练播放列表的学习嵌入进行聚类,然后为给定的测试播放列表选择“邻居”播放列表作为同一集群中的所有其他播放列表。然后我从这些播放列表中取出所有曲目,并将测试播放列表嵌入和这些“相邻”曲目输入我的模型进行预测。这会根据它们(在我的模型下)在给定的测试播放列表中接下来出现的可能性对“相邻”曲目进行排名。

desired_user_id = 123
model_path = Path(PATH, 'model.h5')
print('using model: %s' % model_path)
model =keras.models.load_model(model_path)
print('Loaded model!')

mlp_user_embedding_weights = (next(iter(filter(lambda x: x.name == 'mlp_user_embedding', model.layers))).get_weights())

# get the latent embedding for your desired user
user_latent_matrix = mlp_user_embedding_weights[0]
one_user_vector = user_latent_matrix[desired_user_id,:]
one_user_vector = np.reshape(one_user_vector, (1,32))

print('\nPerforming kmeans to find the nearest users/playlists...')
# get 100 similar users
kmeans = KMeans(n_clusters=100, random_state=0, verbose=0).fit(user_latent_matrix)
desired_user_label = kmeans.predict(one_user_vector)
user_label = kmeans.labels_
neighbors = []
for user_id, user_label in enumerate(user_label):
    if user_label == desired_user_label:
        neighbors.append(user_id)
print('Found {0} neighbor users/playlists.'.format(len(neighbors)))

tracks = []
for user_id in neighbors:
    tracks += list(df[df['pid'] == int(user_id)]['trackindex'])
print('Found {0} neighbor tracks from these users.'.format(len(tracks))) 

users = np.full(len(tracks), desired_user_id, dtype='int32')
items = np.array(tracks, dtype='int32')

# and predict tracks for my user
results = model.predict([users,items],batch_size=100, verbose=0) 
results = results.tolist()
print('Ranked the tracks!')

results_df = pd.DataFrame(np.nan, index=range(len(results)), columns=['probability','track_name', 'track artist'])
print(results_df.shape)

# loop through and get the probability (of being in the playlist according to my model), the track, and the track's artist 
for i, prob in enumerate(results):
    results_df.loc[i] = [prob[0], df[df['trackindex'] == i].iloc[0]['track_name'], df[df['trackindex'] == i].iloc[0]['artist_name']]
results_df = results_df.sort_values(by=['probability'], ascending=False)

results_df.head(20)

而不是上面的代码,我想使用这个https://www.tensorflow.org/recommenders/examples/basic_retrieval#building_a_candidate_ann_index或来自 Spotify 的官方 GitHub 存储库https://github.com/spotify/annoy。不幸的是,我不知道如何使用它,因此新程序为我提供了 20 首最受用户欢迎的曲目。我该如何改变这个?


编辑

我尝试了什么:

from annoy import AnnoyIndex
import random
desired_user_id = 123
model_path = Path(PATH, 'model.h5')
print('using model: %s' % model_path)
model =keras.models.load_model(model_path)
print('Loaded model!')
    
mlp_user_embedding_weights = (next(iter(filter(lambda x: x.name == 'mlp_user_embedding', model.layers))).get_weights())
    
# get the latent embedding for your desired user
user_latent_matrix = mlp_user_embedding_weights[0]
one_user_vector = user_latent_matrix[desired_user_id,:]
one_user_vector = np.reshape(one_user_vector, (1,32))

t = AnnoyIndex(desired_user_id , one_user_vector)  #Length of item vector that will be indexed
for i in range(1000):
    v = [random.gauss(0, 1) for z in range(f)]
    t.add_item(i, v)

t.build(10) # 10 trees
t.save('test.ann')

u = AnnoyIndex(desired_user_id , one_user_vector)
u.load('test.ann') # super fast, will just mmap the file
print(u.get_nns_by_item(0, 1000)) # will find the 1000 nearest neighbors
# Now how to I get the probability and the values? 

标签: pythonalgorithmtensorflowk-means

解决方案


您快到了!

在以 开头的代码中desired_user_id = 123,您有 4 个主要步骤:
1 (L 1-12):user_latent_matrix从保存的模型中检索用户嵌入矩阵 ( )
2 (L 14-23):desired_user_label使用 kmeans 和列出集群中的其他用户 ( neighbors)。同一个集群中的用户应该听你喜欢的歌曲。
3(L 25-31):列出集群中其他用户喜欢的歌曲(tracks)。您喜欢的音乐将与集群中其他人听过的音乐相似。第 2 步和第 3 步只是过滤掉 99% 的所有音乐,因此您只需在最后 1% 上运行模型即可节省时间和金钱。删除 2 和 3 并添加每首歌曲tracks仍然有效(但需要 100 倍的时间)。
4(L 33+):使用保存的模型来预测集群中其他用户喜欢的歌曲是否适合你(results_df

Annoy 是寻找相似用户的替代品(第 2 步)。它不是使用kmeans查找用户的集群然后查找集群中的其他用户,而是使用k 最近邻风格的算法直接查找接近的用户。

在第 12 行找到后one_user_vector,将第 2 步(第 14-23 行)替换为类似

from annoy import AnnoyIndex

user_embedding_length = 23
t = AnnoyIndex(user_embedding_length, 'angular')

# add the user embeddings to annoy (your annoy userids will be the row indexes)
for user_id, user_embedding in enumerate(user_latent_matrix):
    t.add_item(user_id, user_embedding)

# build the forrest
t.build(10) # 10 trees

# save the forest for later if you're using this again and don't want to rebuild the trees every time
t.save('test.ann')

# find the 100 nearest neighbor users
neighbors = t.get_nns_by_item(desired_user_id, 100) 

如果您想再次运行您的东西但不想重建树并且已经运行过一次,请将步骤 2 替换为

from annoy import AnnoyIndex

user_embedding_length = 23
t = AnnoyIndex(user_embedding_length, 'angular')

# load the trees
t.load('test.ann')

# find the 100 nearest neighbor users
neighbors = t.get_nns_by_item(desired_user_id, 100) 

替换步骤 2 中的内容后,只需像正常一样运行步骤 3 和 4(第​​ 25+ 行)


推荐阅读