首页 > 解决方案 > 即使在标记化之后,Keras pad_sequences 也会失败

问题描述

我像这样标记了我的数据框文本内容:

tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(tweets_df['content'])
tweets_df['content'] = tokenizer.texts_to_sequences(tweets_df['content'])

然后尝试填充序列:

X_train = tf.keras.preprocessing.sequence.pad_sequences(X_train,
                                                             maxlen=MAX_LENGTH,
                                                             dtype='int32',
                                                             padding='post',
                                                             truncating='post')

失败:invalid literal for int() with base 10: 'content'

试图找到不是整数的项目:

for arr in X_test['content']:
  for num in arr:
    if (isinstance(num, int)==False):
      print(num)

但这并没有返回任何东西。我错过了什么?

标签: pythontensorflowkerastext

解决方案


看起来错误是因为您试图将某些内容转换为无法转换为 int 的 int。请查看示例工作解决方案

import pandas as pd
cars = {'Brand': [' Hero Honda Civic','Toyota Corolla','Ford Focus','Audi A4 A3 A2 A1']}
df = pd.DataFrame(cars, columns = ['Brand'])

#Tokenize the text
import tensorflow as tf
tokenizer = tf.keras.preprocessing.text.Tokenizer()
tokenizer.fit_on_texts(df['Brand'])
df['Brand'] = tokenizer.texts_to_sequences(df['Brand'])

执行填充

sequence= df['Brand']
MAX_LENGTH = 5
tf.keras.preprocessing.sequence.pad_sequences(sequence, 
                                              maxlen=MAX_LENGTH, 
                                              dtype='int32',
                                              padding='post',
                                              truncating='post' )

array([[ 1,  2,  3,  0,  0],
       [ 4,  5,  0,  0,  0],
       [ 6,  7,  0,  0,  0],
       [ 8,  9, 10, 11, 12]], dtype=int32)

推荐阅读