首页 > 解决方案 > 如何使用 BERT 模型从 [CLS] 令牌中提取句子嵌入

问题描述

我正在关注这个链接:

BERT 文档嵌入

我想使用使用令牌的BERT模型来提取句子嵌入。CLS这是代码:

import torch
from keras.preprocessing.sequence import pad_sequences
import tensorflow as tf

def text_to_embedding(tokenizer, model, in_text):
    '''
    Uses the provided BERT 'model' and 'tokenizer' to generate a vector
    representation of the input string, 'in_text'.

    Returns the vector stored as a numpy ndarray.
    '''

    # ===========================
    #   STEP 1: Tokenization
    # ===========================

    MAX_LEN = 510

    # 'encode' will:
    #  (1) Tokenize the sentence
    #  (2) Prepend the '[CLS]' token to the start.
    #  (3) Append the '[SEP]' token to the end.
    #  (4) Map tokens to their IDs.
    input_ids = tokenizer.encode(
        in_text,                         # sentence to encode.
        add_special_tokens = True,       # Add '[CLS]' and '[SEP]'
        max_length = MAX_LEN,            # Truncate all sentences.
        #return_tensors = 'pt'           # Return pytorch tensors.
    )

    print(input_ids)
    print(tokenizer.decode(input_ids))

    # Pad our input tokens. Truncation was handled above by the 'encode'
    # function, which also makes sure that the '[SEP]' token is placed at the
    # end *after* truncating.
    # Note: 'pad_sequences' expects a list of lists, but we only have one
    # piece of text, so we surround 'input_ids' with an extra set of brackets.

    results = tokenizer(in_text, max_length=MAX_LEN, truncation=True)
    input_ids = results.input_ids
    attn_mask = results.attention_mask
    
    print(results)

    # Cast to tensors.
    input_ids = torch.tensor(input_ids)
    attn_mask = torch.tensor(attn_mask)

    # Add an extra dimension for the "batch" (even though there is only one
    # input in this batch)
    input_ids = input_ids.unsqueeze(0)
    attn_mask = attn_mask.unsqueeze(0)


    # ===========================
    #   STEP 1: Tokenization
    # ===========================

    # Put the model in evaluation mode--the dropout layers behave differently
    # during evaluation.
    
    #model.eval()

    # Copy the inputs to the GPU
    #input_ids = input_ids.to(device)
    #attn_mask = attn_mask.to(device)

    # telling the model not to build the backward graph will make this
    # a little quicker.
    with torch.no_grad():

        # Forward pass, returns hidden states and predictions
        # This will return the logits rather than the loss because we have
        # not provided labels.
        outputs = model(input_ids = input_ids,token_type_ids = None,attention_mask = attn_mask)
        

        hidden_states = outputs[2]

        #Sentence Vectors
        #To get a single vector for our entire sentence we have multiple 
        #application-dependent strategies, but a simple approach is to 
        #average the second to last hiden layer of each token producing 
        #a single 768 length vector.
        # `hidden_states` has shape [13 x 1 x ? x 768]

        # `token_vecs` is a tensor with shape [? x 768]
        token_vecs = hidden_states[-2][0]

        # Calculate the average of all ? token vectors.
        sentence_embedding = torch.mean(token_vecs, dim=0)
        # Move to the CPU and convert to numpy ndarray.
        sentence_embedding = sentence_embedding.detach().cpu().numpy()

        return(sentence_embedding)


from transformers import BertTokenizer, BertModel
# Load pre-trained model (weights)
model = BertModel.from_pretrained('bert-base-uncased',output_hidden_states = True), # Whether the model returns all hidden-states.
#model.cuda()

from transformers import BertTokenizer

# Load the BERT tokenizer.
print('Loadin BERT tokenizer...')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased', do_lower_case=True)

k=text_to_embedding(tokenizer, model, "I like to play cricket")

输出:

<ipython-input-14-f03410b60544> in text_to_embedding(tokenizer, model, in_text)
     77         # This will return the logits rather than the loss because we have
     78         # not provided labels.
---> 79         outputs = model(input_ids = input_ids,token_type_ids = None,attention_mask = attn_mask)
     80 
     81 

TypeError: 'tuple' object is not callable

我在这一行得到一个错误outputs = model(input_ids = input_ids,token_type_ids = None,attention_mask = attn_mask)

我想修改代码以使用CLS 令牌嵌入输入句子,而不是使用隐藏层的平均值。

标签: python-3.xembeddingbert-language-modeltransformer

解决方案


有3种方法可以解决您的问题-

  1. 有一个非常酷的工具叫做bert-as-service。它根据您选择使用的模型将句子映射到固定长度的词嵌入。文档写得很好。安装
pip install bert-serving-server  # server
pip install bert-serving-client  # client, independent of bert-serving-server

下载官方 BERT repo- link中可用的预训练模型之一

启动服务器

bert-serving-start -model_dir /model_directory/ -num_worker=4 

生成嵌入

from bert_serving.client import BertClient
bc = BertClient()
vectors=bc.encode(your_list_of_sentences)
  1. 存在一篇名为Sentence-BERT的学术论文及其github repo

  2. 您正在做很多手动工作 - 填充 attn mask 等。Toeknizer 会自动为您完成,请查看文档。而且,如果你看到模型的forward()调用的实现,它会返回——

 return (sequence_output, pooled_output) + encoder_outputs[1:]

对于bert base(768个隐藏状态),序列输出是序列中所有token的embedding,所以如果你的输入大小[max_len]是510,那么每个token嵌入是一个768维的空间,使得序列输出大小为- 768*510*1

池化输出是将所有嵌入压缩到 768*1 维度的空间中的输出。

因此,我认为您会希望将池化输出用于简单的嵌入。


推荐阅读