首页 > 解决方案 > 在循环网络上使用预训练的句子嵌入

问题描述

我想在循环网络上使用通用句嵌入。

因此,使用 RNN 的传统词嵌入将每个词编码为一个向量,而 RNN 的 time_step 将是一个句子中的词数。

我想做的是使用句子嵌入将每个句子编码成一个 512 维向量。RNN 的 time_step 将是文本中的句子数,或者在我的情况下是 IMDB 评论。

我正在 IMDB 二进制分类上尝试这个。问题是无论我如何调整超参数,模型都不会学习。训练和测试准确率保持在 50%,这意味着该模型仅预测 2 个类中的 1 个。

我将不胜感激任何帮助!

_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
lstm_1 (LSTM)                (None, 128)               131584
_________________________________________________________________
dense_1 (Dense)              (None, 2)                 258
=================================================================
Total params: 131,842
Trainable params: 131,842
Non-trainable params: 0
_________________________________________________________________
WARNING:tensorflow:From C:\Users\shaggyday\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
W0709 14:26:44.883890  9716 deprecation.py:323] From C:\Users\shaggyday\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow\python\ops\math_ops.py:3066: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.cast instead.
Epoch 1/10
249/249 [==============================] - 55s 220ms/step - loss: 0.6937 - acc: 0.5004 - val_loss: 0.6931 - val_acc: 0.5061
Epoch 2/10
249/249 [==============================] - 68s 274ms/step - loss: 0.6970 - acc: 0.5002 - val_loss: 0.6942 - val_acc: 0.5009
Epoch 3/10
249/249 [==============================] - 71s 285ms/step - loss: 0.6947 - acc: 0.4961 - val_loss: 0.6980 - val_acc: 0.5009
Epoch 4/10
249/249 [==============================] - 70s 279ms/step - loss: 0.6938 - acc: 0.4998 - val_loss: 0.6956 - val_acc: 0.5033
Epoch 5/10
249/249 [==============================] - 66s 267ms/step - loss: 0.6936 - acc: 0.5018 - val_loss: 0.6939 - val_acc: 0.5046
Epoch 6/10
249/249 [==============================] - 63s 251ms/step - loss: 0.6931 - acc: 0.5003 - val_loss: 0.6933 - val_acc: 0.5058

预嵌入文本的代码是

file = 'train.csv'
df = pd.read_csv(file)
# df['sentiment'] = [1 if sentiment == 'positive' else 0 for sentiment in df['sentiment'].values]
x = df['review'].values
y = df['sentiment'].values
x_sent = []
for review in x:
    x_sent.append(sent_tokenize( review ) )


num_sample = len(x)
val_split = int(num_sample*0.5)
x_train, y_train = x_sent, y
x_test, y_test = x_sent[val_split:], y[val_split:]

module_url = "https://tfhub.dev/google/universal-sentence-encoder-large/2" 
out_dir = 'use(dan)'
embed = hub.Module(module_url)

num_files = 10
n_file = num_sample // num_files

for n in range( num_files ):

    def batch_embed( batch, labels, lens, set_ ):
        """
        batch:   1-D array of sentences
        labels:  labels for each reviews
        lens:    offsets for the reviews
        set_:    'train' | 'test'
        """
        with tf.Session( config=config ) as session:
          session.run([tf.global_variables_initializer(), tf.tables_initializer()])
          
          print( 'Getting embeddings for the {} data'.format( set_ ) )
          path = os.path.join( out_dir, 'embed_{}_{}.bin'.format( set_ , n ) )
          if not os.path.exists( path ):
            embeddings = session.run( embed( batch ) )
            offset = 0
            review_embeddings = []
            for l in lens:
                review_embeddings.append( embeddings[ offset : offset + l ] )
                offset += l
            with open( path, 'wb' ) as f:
                pickle.dump( (review_embeddings, labels), f )
            
            for i, re in enumerate(embeddings):
                if re.shape[0]==0:
                    print( i, batch[i] )
    
    train_batch = x_train[ n * n_file : min( len( x_train ),  ( n + 1 ) * n_file )  ]
    labels = y_train[ n * n_file : min( len( x_train ),  ( n + 1 ) * n_file )  ]
    lens = [ len( x ) for x in train_batch ]
    sent_batch = [ sent for review in train_batch for sent in review ]
    print( len( sent_batch ) )
    batch_embed(sent_batch, labels, lens, 'train')

    test_batch = x_test[ n * n_file : min( len( x_test ),  ( n + 1 ) * n_file )  ]
    labels = y_test[ n * n_file : min( len( x_test ),  ( n + 1 ) * n_file )  ]
    lens = [ len( x ) for x in test_batch ]
    sent_batch = [ sent for review in test_batch for sent in review ]
    print( len( sent_batch ) )
    batch_embed(sent_batch, labels, lens, 'test')

该模型是一个非常简单的 lstm,有一层和 256 个神经元。每个批次都被填充,因为每个 IMDB 评论都有不同数量的句子

标签: pythonkerasdeep-learningnlpembedding

解决方案


推荐阅读