首页 > 解决方案 > tensorflow_hub 将 BERT 嵌入到 Windows 机器上

问题描述

我想使用 tensorflow hub 嵌入 BERT。我发现嵌入 ELMO 非常容易,我的步骤如下。谁能解释如何让 BERT 嵌入到 Windows 机器上?我找到了这个,但无法在 Windows 机器上运行

  1. https://tfhub.dev/google/elmo/3转到此链接然后下载。

  2. 解压缩两次,直到看到“tfhub_module.pb”,提供该文件夹的路径以获取嵌入

        import tensorflow as tf
        import tensorflow_hub as hub
    
        elmo = hub.Module("C:/Users/nnnn/Desktop/BERT/elmo/3.tar/3", trainable=True)
    
        with tf.Session() as sess:
            sess.run(tf.global_variables_initializer())
            abc1=sess.run(elmo(x, signature="default", as_dict=True)["default"])
    

+++++++++++++++++++++++++++++++++++++++++++++++更新1

我面临的问题列表如下 - 我将一一添加。此页面包含同一作者的完整笔记本。

  1. 当我尝试时import tokenization,我得到一个错误ModuleNotFoundError: No module named 'tokenization'我如何摆脱它?我需要下载tokenization.py并参考它吗?请说清楚

==============更新 2 我能够让它工作。带注释的代码如下

#manually copy paste code from https://github.com/google-research/bert/blob/master/tokenization.py and create a file called C:\\Users\\nn\\Desktop\\BERT\\tokenization.py
#for some reason direct download doesn’t work

#https://github.com/vineetm/tfhub-bert/blob/master/bert_tfhub.ipynb 

#https://stackoverflow.com/questions/44891069/how-to-import-python-file
import sys
import os

print (sys.path)


script_dir = "C:\\Users\\nn\\Desktop\\BERT"



# Add the absolute directory  path containing your
# module to the Python path

sys.path.append(os.path.abspath(script_dir))

import tokenization





import tensorflow_hub as hub
import tensorflow as tf

#download https://tfhub.dev/google/bert_cased_L-12_H-768_A-12/1 and unzip twice
def create_tokenizer(vocab_file='C:\\Users\\nn\\Desktop\\BERT\\bert_cased_L-12_H-768_A-12\\bert_cased_L-12_H-768_A-12~\\assets\\vocab.txt', do_lower_case=False):
    return tokenization.FullTokenizer(vocab_file=vocab_file, do_lower_case=do_lower_case)


tokenizer = create_tokenizer()


def convert_sentence_to_features(sentence, tokenizer, max_seq_len):
    tokens = ['[CLS]']
    tokens.extend(tokenizer.tokenize(sentence))
    if len(tokens) > max_seq_len-1:
        tokens = tokens[:max_seq_len-1]
    tokens.append('[SEP]')

    segment_ids = [0] * len(tokens)
    input_ids = tokenizer.convert_tokens_to_ids(tokens)
    input_mask = [1] * len(input_ids)

    #Zero Mask till seq_length
    zero_mask = [0] * (max_seq_len-len(tokens))
    input_ids.extend(zero_mask)
    input_mask.extend(zero_mask)
    segment_ids.extend(zero_mask)

    return input_ids, input_mask, segment_ids

def convert_sentences_to_features(sentences, tokenizer, max_seq_len=20):
    all_input_ids = []
    all_input_mask = []
    all_segment_ids = []

    for sentence in sentences:
        input_ids, input_mask, segment_ids = convert_sentence_to_features(sentence, tokenizer, max_seq_len)
        all_input_ids.append(input_ids)
        all_input_mask.append(input_mask)
        all_segment_ids.append(segment_ids)

    return all_input_ids, all_input_mask, all_segment_ids



#BERT_URL = 'https://tfhub.dev/google/bert_cased_L-12_H-768_A-12/1'

BERT_URL ='C:\\Users\\nn\\Desktop\\BERT\\bert_cased_L-12_H-768_A-12\\bert_cased_L-12_H-768_A-12~'

module = hub.Module(BERT_URL)
sess = tf.Session()
sess.run(tf.global_variables_initializer())


input_ids = tf.placeholder(dtype=tf.int32, shape=[None, None])
input_mask = tf.placeholder(dtype=tf.int32, shape=[None, None])
segment_ids = tf.placeholder(dtype=tf.int32, shape=[None, None])

bert_inputs = dict(
    input_ids=input_ids,
    input_mask=input_mask,
    segment_ids=segment_ids)

bert_outputs = module(bert_inputs, signature="tokens", as_dict=True)


sentences = ['New Delhi is the capital of India', 'The capital of India is Delhi']
input_ids_vals, input_mask_vals, segment_ids_vals = convert_sentences_to_features(sentences, tokenizer, 10)#max_seq_len parameter

out = sess.run(bert_outputs, feed_dict={input_ids: input_ids_vals, input_mask: input_mask_vals, segment_ids: segment_ids_vals})

out['sequence_output'].shape

out['pooled_output'].shape

out.keys()

type(out['pooled_output'])

x1=out['sequence_output'][0,:,:]
x2=out['sequence_output'][1,:,:]#Sentences length is 7, even if i add cls and sep tokens, the length is 9. max_seq_len parameter is 10, then why are the last row of x1 and x2 not same?

标签: pythonwindowstensorflowtensorflow-hubelmo

解决方案


推荐阅读