deep-learning - 如何微调句子转换器以理解语义相似性
问题描述
我正在使用 BERT 模型在意大利语中进行上下文搜索,但它不理解句子的上下文含义并返回错误的结果。
在下面的示例代码中,当我将“带有巧克力味的牛奶”与其他两种类型的牛奶和一种巧克力进行比较时,它会返回与巧克力的高度相似性。它应该返回与其他种类的牛奶的高度相似性。
谁能建议我如何微调句子转换器,以便它可以理解文本的语义并根据它返回相似度?
代码 :
!python -m spacy download it_core_news_lg
!pip install sentence-transformers
import scipy
import numpy as np
from sentence_transformers import models, SentenceTransformer
model = SentenceTransformer('distiluse-base-multilingual-cased') # workes with Arabic, Chinese, Dutch, English, French, German, Italian, Korean, Polish, Portuguese, Russian, Spanish, Turkish
corpus = [
"Alpro, Cioccolato bevanda a base di soia 1 ltr", #Alpro, Chocolate soy drink 1 ltr(soya milk)
"Milka cioccolato al latte 100 g", #Milka milk chocolate 100 g
"Danone, HiPRO 25g Proteine gusto cioccolato 330 ml", #Danone, HiPRO 25g Protein chocolate flavor 330 ml(milk with chocolate flabor)
]
corpus_embeddings = model.encode(corpus)
queries = [
'latte al cioccolato', #milk with chocolate flavor,
]
query_embeddings = model.encode(queries)
# Calculate Cosine similarity of query against each sentence i
closest_n = 10
for query, query_embedding in zip(queries, query_embeddings):
distances = scipy.spatial.distance.cdist([query_embedding], corpus_embeddings, "cosine")[0]
results = zip(range(len(distances)), distances)
results = sorted(results, key=lambda x: x[1])
print("\n======================\n")
print("Query:", query)
print("\nTop 10 most similar sentences in corpus:")
for idx, distance in results[0:closest_n]:
print(corpus[idx].strip(), "(Score: %.4f)" % (1-distance))
输出 :
======================
Query: latte al cioccolato
Top 10 most similar sentences in corpus:
Milka cioccolato al latte 100 g (Score: 0.7714)
Alpro, Cioccolato bevanda a base di soia 1 ltr (Score: 0.5586)
Danone, HiPRO 25g Proteine gusto cioccolato 330 ml (Score: 0.4569)
解决方案
推荐阅读
- css - SASS 函数随着字体大小的增加而减小行高
- javascript - 在第一次单击按钮时获取 document.getElementById 的空值以以角度显示进度条
- javascript - 悬停其他标签时如何摆脱标签内的类
- android - Geocoder 不恰当的阻塞方法调用
- java - Java fileReader 无法正确读取文件
- python - 如何将 Pandas 与 cuDF 一起使用?
- docker - 在 WordPress 开发环境中设置 XDebug 3 (Docker + WSL 2)
- ant-media-server - 蚂蚁传媒架构?
- json - 在 Visual Studio Code 中调试 (gdb) 启动和 (Windows) 启动失败
- firebase - 参数类型“对象?” 不能分配给参数类型“地图”
'