python-3.x - 如何使用 BERT 模型从 [CLS] 令牌中提取句子嵌入
问题描述
我正在关注这个链接:
我想使用使用令牌的BERT
模型来提取句子嵌入。CLS
这是代码:
import torch
from keras.preprocessing.sequence import pad_sequences
import tensorflow as tf
def text_to_embedding(tokenizer, model, in_text):
'''
Uses the provided BERT 'model' and 'tokenizer' to generate a vector
representation of the input string, 'in_text'.
Returns the vector stored as a numpy ndarray.
'''
# ===========================
# STEP 1: Tokenization
# ===========================
MAX_LEN = 510
# 'encode' will:
# (1) Tokenize the sentence
# (2) Prepend the '[CLS]' token to the start.
# (3) Append the '[SEP]' token to the end.
# (4) Map tokens to their IDs.
input_ids = tokenizer.encode(
in_text, # sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]'
max_length = MAX_LEN, # Truncate all sentences.
#return_tensors = 'pt' # Return pytorch tensors.
)
print(input_ids)
print(tokenizer.decode(input_ids))
# Pad our input tokens. Truncation was handled above by the 'encode'
# function, which also makes sure that the '[SEP]' token is placed at the
# end *after* truncating.
# Note: 'pad_sequences' expects a list of lists, but we only have one
# piece of text, so we surround 'input_ids' with an extra set of brackets.
results = tokenizer(in_text, max_length=MAX_LEN, truncation=True)
input_ids = results.input_ids
attn_mask = results.attention_mask
print(results)
# Cast to tensors.
input_ids = torch.tensor(input_ids)
attn_mask = torch.tensor(attn_mask)
# Add an extra dimension for the "batch" (even though there is only one
# input in this batch)
input_ids = input_ids.unsqueeze(0)
attn_mask = attn_mask.unsqueeze(0)
# ===========================
# STEP 1: Tokenization
# ===========================
# Put the model in evaluation mode--the dropout layers behave differently
# during evaluation.
#model.eval()
# Copy the inputs to the GPU
#input_ids = input_ids.to(device)
#attn_mask = attn_mask.to(device)
# telling the model not to build the backward graph will make this
# a little quicker.
with torch.no_grad():
# Forward pass, returns hidden states and predictions
# This will return the logits rather than the loss because we have
# not provided labels.
outputs = model(input_ids = input_ids,token_type_ids = None,attention_mask = attn_mask)
hidden_states = outputs[2]
#Sentence Vectors
#To get a single vector for our entire sentence we have multiple
#application-dependent strategies, but a simple approach is to
#average the second to last hiden layer of each token producing
#a single 768 length vector.
# `hidden_states` has shape [13 x 1 x ? x 768]
# `token_vecs` is a tensor with shape [? x 768]
token_vecs = hidden_states[-2][0]
# Calculate the average of all ? token vectors.
sentence_embedding = torch.mean(token_vecs, dim=0)
# Move to the CPU and convert to numpy ndarray.
sentence_embedding = sentence_embedding.detach().cpu().numpy()
return(sentence_embedding)
from transformers import BertTokenizer, BertModel
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',output_hidden_states = True), # Whether the model returns all hidden-states.
#model.cuda()
from transformers import BertTokenizer
# Load the BERT tokenizer.
print('Loadin BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)
k=text_to_embedding(tokenizer, model, "I like to play cricket")
输出:
<ipython-input-14-f03410b60544> in text_to_embedding(tokenizer, model, in_text)
77 # This will return the logits rather than the loss because we have
78 # not provided labels.
---> 79 outputs = model(input_ids = input_ids,token_type_ids = None,attention_mask = attn_mask)
80
81
TypeError: 'tuple' object is not callable
我在这一行得到一个错误outputs = model(input_ids = input_ids,token_type_ids = None,attention_mask = attn_mask)
我想修改代码以使用CLS
令牌嵌入输入句子,而不是使用隐藏层的平均值。
解决方案
有3种方法可以解决您的问题-
- 有一个非常酷的工具叫做bert-as-service。它根据您选择使用的模型将句子映射到固定长度的词嵌入。文档写得很好。安装
pip install bert-serving-server # server
pip install bert-serving-client # client, independent of bert-serving-server
下载官方 BERT repo- link中可用的预训练模型之一
启动服务器
bert-serving-start -model_dir /model_directory/ -num_worker=4
生成嵌入
from bert_serving.client import BertClient
bc = BertClient()
vectors=bc.encode(your_list_of_sentences)
存在一篇名为Sentence-BERT的学术论文及其github repo
您正在做很多手动工作 - 填充 attn mask 等。Toeknizer 会自动为您完成,请查看文档。而且,如果你看到模型的forward()调用的实现,它会返回——
return (sequence_output, pooled_output) + encoder_outputs[1:]
对于bert base(768个隐藏状态),序列输出是序列中所有token的embedding,所以如果你的输入大小[max_len]是510,那么每个token嵌入是一个768维的空间,使得序列输出大小为- 768*510*1
池化输出是将所有嵌入压缩到 768*1 维度的空间中的输出。
因此,我认为您会希望将池化输出用于简单的嵌入。
推荐阅读
- javascript - 分组产品的 Ajax 添加到购物车按钮
- pandas - 绘制堆积条形图
- sql - postgre sql - 在函数中使用时查询仅返回单个记录而不是列分隔表
- maven - 如何通过命令行在 serenity 2.0.x 中传递黄瓜标签
- modal-dialog - 本地组件状态的 Apollo 缓存或 useState()?(模态,过滤器......)
- angular - 我们如何将 path.td 转换为谷歌地图 api 中的纬度和经度数组?
- php - zsh:权限被拒绝:/Users/..;gem install rails ERROR: While execution gem ... (Errno::EACCES) Permission denied @ rb_sysopen - /Users/
- python - 给定一个项目,返回列表中的下一个项目
- sql - 查询以获取答案最多的问题
- kubernetes - Openshift Kubernetes 应用程序无法在 Jetty 服务器上启动:java.net.URISyntaxException:索引 7 处的预期权限