首页 > 解决方案 > LSTM Autoenconder 中的 Keras 嵌入层

问题描述

我正在尝试实现 char2vec 模型以将人名转换或映射为 50 维或任何 N 维向量。它与 FastText 的 get_word_vector 或 scikit-learn 的 TfidfVectorizer 非常相似。

基本上,我从ethnicolr 的笔记本中找到了有监督的LSTM 模型,我正在尝试将其转换为无监督或自动编码器模型。

这是模型的详细信息。输入是人名的二元字符的填充序列。

输入:

person_name = [Heynis, Noordewier-Reddingius, De Quant, Ahanfouf, Falaturi ,...]

### Convert person name to sequence with post padding
X_train = array([[101,  25, 180,  95, 443,   9, 343, 198,  38,  84,  37,   0,   0,   0,   0,   0,   0],
       [128,  27,   8,   6,  22,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [142, 350, 373,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       [999,  14,  33,  16, 512,  36,  52, 352,  14,  33,   5, 211, 143,   0,   0,   0,   0],
       [146,  54,  99,  72, 102,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0],
       ...]

模型:

model = Sequential()
model.add(Embedding(num_words, 32, input_length=feature_len))
model.add(LSTM(128, dropout=0.2, recurrent_dropout=0.2))
model.add(Dense(num_classes, activation='softmax'))

理想情况下,这就是我正在寻找的:

inputs = Input(shape=(feature_len,  ))
embedded = Embedding(num_words, 32)(inputs)
encoded = LSTM(50, dropout=0.2, recurrent_dropout=0.2)(embedded)

decoded = LSTM()(encoded)
decoded_inverse_embedded= Inverse_Embedding()(decoded )   # I know it's silly.
outputs = Layer_something()   # to convert back to the original shape

autoencoder_model= Model(inputs, outputs)
encoder = Model(inputs, encoded)   # This is what I want, ultimately.

autoencoder_model.fit(X_train, X_train) 

这是我尝试过的: 我从https://stackoverflow.com/a/59576475/3015105获得了代码。似乎训练数据在被输入模型之前已经过重塑,因此不需要嵌入层。RepeatVector 和 TimeDistributed 层用于重塑输出。该模型对我来说似乎是正确的,但我不确定这个 reshape 和 TimeDistributed 层是否类似于 Embedding 层。

sequence = X_train.reshape((len(X_train), feature_len, 1))

#define encoder
visible = Input(shape=(feature_len, 1))
encoder = LSTM(50, activation='relu')(visible)

# define reconstruct decoder
decoder1 = RepeatVector(feature_len)(encoder)
decoder1 = LSTM(50, activation='relu', return_sequences=True)(decoder1)
decoder1 = TimeDistributed(Dense(1))(decoder1)

myModel = Model(inputs=visible, outputs=decoder1)

myModel.fit(sequence, sequence, epochs=400)

结果似乎不正确。这个问题还有其他方法吗?我已经尝试过 FastText(通过 gensim)和 TF-IDF 模型,我很好奇这个模型是否会更好。

标签: pythontensorflowmachine-learningkerasdeep-learning

解决方案


推荐阅读