首页 > 解决方案 > 构建编码器-解码器

问题描述

我是深度学习领域的新手,我正在尝试构建一个编码器。

trainFromTextFile = "train.FROM"
trainToTextFile   = "train.TO"
trainFromText     = open(trainFromTextFile, 'r', encoding='utf-8').read().lower()
trainToText       = open(trainToTextFile, 'r', encoding='utf-8').read().lower()
trainFromSentence = re.split('\n', trainFromText)
trainToSentence   = re.split('\n', trainToText)
trainFromWords = re.split(' |\n', trainFromText)
trainToWords   = re.split(' |\n', trainToText)

print('Found %s sentences from TrainFrom Text' %len(trainFromSentence))
print('Found %s sentences from TrainTo Text' %len(trainToSentence))
print('Found %s words from TrainFrom Text' %len(trainFromWords))
print('Found %s words from TrainTo Text' %len(trainToWords))

trainInput = trainFromSentence[0:1000]
trainTarget = trainToSentence[0:1000]

max_len = 100    # Cut comments after 100 words
max_words = 10000  # Consider the top 10,000 words in the dataset

tokenizerInput = Tokenizer(num_words=max_words)
tokenizerInput.fit_on_texts(trainInput)

wordInput = tokenizerInput.text_to_word_sequence(trainInput)
sequencesInput = tokenizerInput.texts_to_sequences(trainInput)
sequencesInput = pad_sequences(sequencesInput, maxlen=max_len)  #Pad so all the arrays are the same size

Inputindex = tokenizerInput.word_index
Inputcount = tokenizerInput.word_counts
nInput = len(tokenizerInput.word_counts) + 1

print("Train From File:\n")
print('Found %s sentences.' %len(trainInput))
print('Found %s sequences.' %len(sequencesInput))
print('Found %s unique tokens.' % len(Inputindex))
print('Found %s unique words.' % len(Inputcount))

这就是我到目前为止所拥有的,我想知道如何使用我手头的数据并构建一个编码器来接收这些数据。

标签: pythondeep-learning

解决方案


这通常是您构建不同类型的自动编码器链接的方式。但从您的问题来看,您似乎对使用主要基于递归神经网络的编码器-解码器类型模型的序列到序列预测感兴趣。可以在这里找到一个教程链接


推荐阅读