首页 > 解决方案 > 为什么 keras 标记器将小写()应用于它自己的标记?

问题描述

我正在使用带有内置的 IMDB 数据集运行我的第一个 cnn 文本分类器 tf.keras.datasets.imdb.load_data()

我了解该AttributeError: 'int' object has no attribute 'lower'错误表明小写函数正在应用于 int 对象(似乎来自标记器)。但是,我不知道为什么在这种情况下它会抛出这个,因为我是通过内置的tf.keras.datasets.imdb.load_data().

我没有在文本分类中使用嵌入的经验。

排除CNN模型的代码为:

import tensorflow as tf
from keras.preprocessing import sequence
from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation
from keras.layers import Embedding, LSTM
from keras.layers import Conv1D, Flatten, MaxPooling1D
from keras.datasets import imdb
import wandb
from wandb.keras import WandbCallback
import numpy as np
from keras.preprocessing import text

import imdb

wandb.init(mode="disabled") # disabled for debugging
config = wandb.config

# set parameters:
config.vocab_size = 1000        
config.maxlen = 1000
config.batch_size = 32
config.embedding_dims = 10
config.filters = 16
config.kernel_size = 3
config.hidden_dims = 250
config.epochs = 10

(X_train, y_train), (X_test, y_test) = tf.keras.datasets.imdb.load_data()

tokenizer = text.Tokenizer(num_words=config.vocab_size)
tokenizer.fit_on_texts(X_train)
X_train = tokenizer.texts_to_matrix(X_train)
X_test = tokenizer.texts_to_matrix(X_test)

X_train = sequence.pad_sequences(X_train, maxlen=config.maxlen)
X_test = sequence.pad_sequences(X_test, maxlen=config.maxlen)

错误中提到的第 34 行是tokenizer = text.Tokenizer(num_words=config.vocab_size)

抛出的确切错误(包括弃用警告)是:

C:\Users\Keegan\anaconda3\envs\oldK\lib\site- 
packages\tensorflow_core\python\keras\datasets\imdb.py:129: 
VisibleDeprecationWarning: Creating an ndarray from ragged nested sequences (which is a list- 
or-tuple of lists-or-tuples-or ndarrays with different lengths or shapes) is deprecated. If 
you meant to do this, you must specify 'dtype=object' when creating the ndarray.
 x_train, y_train = np.array(xs[:idx]), np.array(labels[:idx])

C:\Users\Keegan\anaconda3\envs\oldK\lib\site- 
packages\tensorflow_core\python\keras\datasets\imdb.py:130: VisibleDeprecationWarning: Creating 
an ndarray from ragged nested sequences (which is a list-or-tuple of lists-or-tuples-or 
ndarrays with different lengths or shapes) is deprecated. If you meant to do this, you must 
specify 'dtype=object' when creating the ndarray.
x_test, y_test = np.array(xs[idx:]), np.array(labels[idx:])
Traceback (most recent call last):
  File "imdb-cnn.py", line 34, in <module>
tokenizer.fit_on_texts(X_train)
  File "C:\Users\Keegan\anaconda3\envs\oldK\lib\site-packages\keras_preprocessing\text.py", 
line 217, in fit_on_texts
     text = [text_elem.lower() for text_elem in text]
  File "C:\Users\Keegan\anaconda3\envs\oldK\lib\site-packages\keras_preprocessing\text.py", line 217, in <listcomp>
     text = [text_elem.lower() for text_elem in text]

AttributeError: 'int' object has no attribute 'lower'

Anaconda venv 有 Python 3.7.1、Tensorflow 2.1.0 和 Keras 2.3.1

标签: kerastokenizeword-embeddingdata-preprocessing

解决方案


Keras标记器有一个属性lower,可以设置为TrueFalse

我猜预打包的 IMDB 数据默认小写的原因是数据集非常小。如果您不将其小写,则大写和小写单词将获得不同的嵌入,但大写形式在训练数据中的频率可能不足以适当地训练嵌入。这当然会改变,一旦您使用预训练的嵌入或预训练的上下文模型,例如在大数据上预训练的 BERT。


推荐阅读