keras - 使用 Keras 的句子级编码
问题描述
我的任务是使用 Keras 为文本相似度建立一个监督模型。我的输入是一对长格式文本,目标是 0 或 1。我正在尝试对每个文本的句子中的单词进行编码。目的是从文档结构的不同级别获取知识。为此,我首先将文本拆分为句子列表,然后进行标记化。我还修复了 max_sentence_length(文本中的最大句子数)和 max_sequence_length(一个句子中的最大单词数)。下面是我的预处理代码:
input_text1 = [sen.split('.') for sen in input_text1]
input_text2 = [sen.split('.') for sen in input_text2]
max_words = 20000 (vocab size)
max_sequence_length = 500
max_sentence_length = 100
emb_dim = 50
n_classes = 2
latent_dim = 128
lr = 0.001
epochs = 20
batch_size = 128
tokenizer = Tokenizer(num_words=max_words, filters='!"#$%&()*+,-.:;=?@[\\]^_`{|}~\t\n', lower=True, split=' ', oov_token="UNK")
encoder = LabelEncoder()
tr_sent1, te_sent1, tr_sent2, te_sent2, tr_rel, te_rel = train_test_split(input_text1, input_text2, similarity, test_size=0.2, stratify=similarity)
input_text_sen1 = tr_sent1[:2]
input_text_sen2 = tr_sent2[:2]
tr_rel = tr_rel[:2]
for sen in input_text_sen1 + input_text_sen2:
tokenizer.fit_on_texts(sen)
encoder.fit(tr_rel)
tokenizer.word_index = {e: i for e, i in tokenizer.word_index.items() if i <= max_words}
tokenizer.word_index[tokenizer.oov_token] = max_words + 1
seqs1 = []
for sen in input_text_sen1:
tmp = tokenizer.texts_to_sequences(sen)
seqs1.append(tmp)
# to fix the number of sentences
seqs1_fixed = []
for sen in seqs1:
sen = sen[:max_sentence_length]
seqs1_fixed.append(sen)
seqs1_fixed = [pad_sequences(sen, maxlen=max_sequence_length,
value=0, padding='post', truncating='post') for sen in seqs1_fixed]
seqs2 = []
for sen in input_text_sen2:
tmp = tokenizer.texts_to_sequences(sen)
seqs2.append(tmp)
seqs2_fixed = []
for sen in seqs2:
sen = sen[:max_sentence_length]
seqs2_fixed.append(sen)
seqs2_fixed = [pad_sequences(sen, maxlen=max_sequence_length,
value=0, padding='post', truncating='post') for sen in seqs2_fixed]
categorical_y = encoder.transform(tr_rel)
下面是构建模型的代码:
bilstm = Bidirectional(LSTM(units=latent_dim, return_sequences=True, dropout=0.2, recurrent_dropout=0.2))
encoder_input1 = [Input(shape=(max_sentence_length,), name='desc_word_' + str(i + 1)) for i in range(max_sequence_length)] # list with each shape=(None, 100)
text_embedding_input1 = Embedding(input_dim=max_words+2, output_dim=emb_dim, input_length=max_sentence_length) # list with each shape=(None, 100, 50)
embedding_input1 = [text_embedding_input1(inp) for inp in encoder_input1]
words_concat1 = concatenate(embedding_input1, axis=-1) # shape=(None, 100, 25000)
bilstm_out1 = bilstm(words_concat1) # shape=(None, 100, 256)
encoder_input2 = [Input(shape=(max_sentence_length, ), name='desc_word_' + str(i + 1)) for i in range(max_sequence_length)]
text_embedding_input2 = Embedding(input_dim=max_words+2, output_dim=emb_dim, input_length=max_sentence_length)
embedding_input2 = [text_embedding_input2(inp) for inp in encoder_input2]
words_concat2 = concatenate(embedding_input2, axis=-1)
bilstm_out2 = bilstm(words_concat2)
x1 = attention()(bilstm_out1) # shape=(None, 256)
x1 = Dropout(0.2)(x1)
x2 = attention()(bilstm_out2) # shape=(None, 256)
x2 = Dropout(0.2)(x2)
x = concatenate([x1, x2]) # shape=(None, 512)
out = Dense(units=n_classes, activation="softmax", kernel_regularizer=l2(0.01), bias_regularizer=l2(0.01))(x) # shape=(None, 2)
model = Model([encoder_input1 + encoder_input2], out)
现在我在运行模型时遇到以下错误:
ValueError: Input tensors to a Model must come from `keras.layers.Input`. Received: [<tf.Tensor 'desc_word_1_16:0' shape=(None, 100) dtype=float32>, <tf.Tensor 'desc_word_2_16:0' shape=(None, 100) dtype=float32>, <tf.Tensor 'desc_word_3_16:0'....
请告知导致此错误的原因。我坚持了很长时间。此外,也欢迎提出解决此问题的替代建议。
解决方案
在这行代码中:model = Model([encoder_input1 + encoder_input2], out)
,
将其修改为model = Model(inputs=[encoder_input1] + [encoder_input2], outputs=out)
推荐阅读
- python - 错误:pyzmq 的轮子无效,找到多个 .dist-info 目录:libsodium-1.0.17.dist-info,pyzmq-18.1.0.dist-info
- vue.js - Google oauth 在 Facebook 应用程序 webview 中不起作用
- object - 测试指针是否为 nil 或分配会导致访问冲突
- python - 使用 Dash Cytoscape 在回调中更改节点的标签
- tensor - AttributeError:类型对象“h5py.h5.H5PYConfig”没有属性“__reduce_cython__”,版本不兼容?
- java - Kafka Streams API-过滤器运算符:“错误:类型参数的数量错误;需要 1”
- c# - 分配谓词变量而不是 lambda 表达式
- django - 具有标记为 pk + fk 的字段的表上的 Django HyperlinkedModelSerializer
- python - scipy dendrogram 不一致地杀死 Spyder 内核
- python - 在 Python 中查找得分最高的单词