python - 语义搜索微调
问题描述
例如。句子余弦相似度的预训练 BERT 结果
======================
Query: milk with chocolate flavor
Top 10 most similar sentences in corpus:
Milka milk chocolate 100 g (Score: 0.8672)
Alpro, Chocolate soy drink 1 ltr (Score: 0.6821)
Danone, HiPRO 25g Protein chocolate flavor 330 ml (Score: 0.6692)
在上面的示例中,我正在搜索牛奶,结果应该首先与牛奶相关,但在这里它首先返回巧克力。如何微调结果的相似性?
我用谷歌搜索了它,但没有找到任何合适的解决方案,请帮助我。
代码:
import scipy
import numpy as np
from sentence_transformers import models, SentenceTransformer
model = SentenceTransformer('distilbert-base-multilingual-cased')
corpus = [
"Alpro, Chocolate soy drink 1 ltr",
"Milka milk chocolate 100 g",
"Danone, HiPRO 25g Protein chocolate flavor 330 ml"
]
corpus_embeddings = model.encode(corpus)
queries = [
'milk with chocolate flavor',
]
query_embeddings = model.encode(queries)
# Calculate Cosine similarity of query against each sentence i
closest_n = 10
for query, query_embedding in zip(queries, query_embeddings):
distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]
results = zip(range(len(distances)), distances)
results = sorted(results, key=lambda x: x[1])
print("\n======================\n")
print("Query:", query)
print("\nTop 10 most similar sentences in corpus:")
for idx, distance in results[0:closest_n]:
print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))
解决方案
尝试距离阈值
import scipy
import numpy as np
from sentence_transformers import models, SentenceTransformer
model = SentenceTransformer('distilbert-base-multilingual-cased')
corpus = [
"Alpro, Chocolate soy drink 1 ltr",
"Milka milk chocolate 100 g",
"Danone, HiPRO 25g Protein chocolate flavor 330 ml"
]
corpus_embeddings = model.encode(corpus)
queries = [
'milk with chocolate flavor',
]
query_embeddings = model.encode(queries)
# Calculate Cosine similarity of query against each sentence i
closest_n = 10
for query, query_embedding in zip(queries, query_embeddings):
distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]
results = zip(range(len(distances)), distances)
results = sorted(results, key=lambda x: x[1])
print("\n======================\n")
print("Query:", query)
print("\nTop 10 most similar sentences in corpus:")
for idx, distance in results[0:closest_n]:
if 1-distance>0.7:
print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))
推荐阅读
- dart - 空的 const 构造函数重要吗?
- angular - Angular mat-select在编辑时不显示区域名称
- flutter - Flutter:如何平滑 PageView onPageChanged 动画
- linux - 无法从 kubernetes pod 挂载 azure 文件共享(在本地机器上工作正常)
- php - Soap BAD_REQUEST 无法解析模板
- firebase - 如何过滤带有文档 ID 的流?
- c# - 使用 C# 在 MongoDB 中读取字段的值
- flutter - 如何使 FloatingActionButton 动画化?
- swift - 如何使用 Swift 5 中结构的键和值过滤字典?
- php - 动态多级下拉菜单php sql