首页 > 解决方案 > 如何为大文本添加池化层到 BERT QA

问题描述

我正在尝试实现一个处理大型输入文本的问答系统:所以想法是将大型输入文本拆分为 510 个标记的子序列,之后我将独立生成每个序列的表示并使用池化层生成输入序列的最终表示。

我将 CamemBERT 模型用于法语。

我尝试了以下代码:

class CamemBERTQA(nn.Module):

# the initialization of the model
   def __init__(self, do_lower_case: bool = True):
       super(CamemBERTQA, self).__init__()
       self.config_keys = ['do_lower_case']
       self.do_lower_case = do_lower_case
       self.camembert = CamembertForQuestionAnswering.from_pretrained('fmikaelian/camembert-base-fquad')
       self.tokenizer = CamembertTokenizer.from_pretrained('fmikaelian/camembert-base-fquad', do_lower_case=do_lower_case)
       self.cls_token_id = self.tokenizer.convert_tokens_to_ids([self.tokenizer.cls_token])[0]
       self.sep_token_id = self.tokenizer.convert_tokens_to_ids([self.tokenizer.sep_token])[0]
       self.pool = nn.MaxPool2d(2, 2)


# Split long input text into subsequences with overlapping
   def split_text(self, text, max_length, overlapp): #511 max
       f = []
       text = text.split()
       for i in range(0, int(len(text)-overlapp),(max_length-overlapp)):
           f.append(" ".join(text[i:i+max_length]))
#             print (f)
       return f

# Generate representation of a text,
   def text_representation(self, l): #  l here is a list
       result = []
       for i in l:
           input_ids = torch.tensor([self.tokenizer.encode(i, add_special_tokens=True)])
           with torch.no_grad():
               last_hidden_states = model(input_ids)[0]  # Models outputs are now tuples
               result.append(last_hidden_states)
#                     print(last_hidden_states[0])
       return result


   def forward(self, text, input_ids):
       # Split input text to subsequences of 511 with overlapping
       subsequences = self.split_text(text, 511, 10)

       # Generate IDs of each subsequence (Sequence representation)
       input_ids_list = self.text_representation(subsequences)
       print("input_ids_list")


       # Pooling layer
#         pool = self.pool(...)


###########      The problem is here: how can I add a pooling layer                  #################


#         input_ids = # the final output of the pooling layer, the result should contain 510 elements/tokens

       # generate the start and end logits of the answer
       start_scores, end_scores = self.camembert(torch.tensor([input_ids]))
       start_logits = torch.argmax(start_scores)
       end_logits = torch.argmax(end_scores)+1
       outputs = (start_logits, end_logits,)
#         print(outputs)

       return outputs

由于我是 pyTorch 的初学者,我不确定代码是否应该是这样的。

如果您有任何建议或需要更多信息,请与我联系。

标签: pythonpytorchbert-language-modelquestion-answeringmax-pooling

解决方案


我自己对这一切都很陌生,但也许这可以帮助你:

    def max_pooling(input_tensor, max_sequence_length):
    
        mxp = nn.MaxPool2d((max_sequence_length, 1),stride=1)
        return mxp(input_tensor)

推荐阅读