首页 > 解决方案 > 制作 NER 模型时处理生词

问题描述

我正在研究我在 Python 的 Keras 库中制作的自定义命名实体识别模型。我已经读过我应该枚举所有出现的单词,以便获得矢量化序列。我已经这样做了:

word2idx = {w: i + 1 for i, w in enumerate(words)}
label2idx = {t: i for i, t in enumerate(labels)}

# CREATING FEATURES(X) AND RESULTS(Y)
max_len = 50 
num_words = len(num_words) #number of unique words in dataset
X = [[word2idx[w[0]] for w in s] for s in list_of_sentances]
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=num_words-1)

y = [[label2idx[w[1]] for w in s] for s in list_of_sentances]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=label2idx["O"])
y = [to_categorical(i, num_classes=num_labels) for i in y]

这是我的最终模型:

input_word = Input(shape=(max_len,))

model = Embedding(input_dim = num_words, output_dim = 50, input_length = max_len)(input_word)
model = SpatialDropout1D(0.2)(model)
model = Bidirectional(LSTM(units = 5, return_sequences=True, recurrent_dropout = 0.1))(model)
out = TimeDistributed(Dense(num_labels, activation = "softmax"))(model)

model = Model(input_word, out)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_1 (InputLayer)         [(None, 30)]              0         
_________________________________________________________________
embedding (Embedding)        (None, 30, 50)            2187550   
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, 30, 50)            0         
_________________________________________________________________
bidirectional (Bidirectional (None, 30, 10)            2240      
_________________________________________________________________
time_distributed (TimeDistri (None, 30, 11)            121       
=================================================================
Total params: 2,189,911  #LOOK AD THIS NUMBER
Trainable params: 2,189,911
Non-trainable params: 0

我的准确率为 98%,损失为 0.07。我喜欢这些结果,但由于缺少单词,我无法做出预测。例如:

text = "I live in the Ohio and my name is Alex Wright and I work in AvcCC LTD"
text = text.split()
text = [word2idx[w] for w in text]

text = np.array(text)
print(text)
text=text.reshape(1,text.shape[0])

max_len = 50
text = pad_sequences(maxlen=max_len, sequences=text, padding="post", value=num_words-1)
print('PREDICTION')
res = model.predict(text).argmax(axis=-1)[0]
print(res)

错误:

KeyError: 'AvcCC'

在我的数据集中和词汇中没有“AvcCC”这个词,如何处理?

我想在生产中使用该代码/模型。由于我的 word2idx 仅包含起始数据中的单词,我如何处理不在我的 word2idx 词汇表中的单词?例如,我的 word2idx 词汇表不可能包含所有存在的姓名和姓氏,或者所有城市/位置、所有公司名称、俚语等。

我的词汇表有大约 40k 个枚举单词(这是我的数据集中唯一单词的数量)。然后,我用超过 100k 的其他词丰富了它。(我做了一个爬取不同类型新闻文章的网络爬虫)。所以现在,我的词汇有大约 14 万个单词。现在,我不是从数据集中枚举唯一的单词,而是加载我的新 word2idx/vocabulary。

word2idx = open('english-vocab.json')
word2idx = json.load(word2idx)

max_len = 50 
num_words = len(num_words) #number of unique words in dataset
X = [[word2idx[w[0]] for w in s] for s in list_of_sentances]
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=num_words-1)

y = [[label2idx[w[1]] for w in s] for s in list_of_sentances]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=label2idx["O"])
y = [to_categorical(i, num_classes=num_labels) for i in y]

准确度和损失保持不变,但由于总参数,我的模型变得更慢(我不能再使用 num_words 因为它显示错误,我需要使用len(word2idx)

input_word = Input(shape=(max_len,))

model = Embedding(input_dim = len(word2idx), output_dim = 50, input_length = max_len)(input_word)
model = SpatialDropout1D(0.2)(model)
model = Bidirectional(LSTM(units = 5, return_sequences=True, recurrent_dropout = 0.1))(model)
out = TimeDistributed(Dense(num_labels, activation = "softmax"))(model)

model = Model(input_word, out)
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
=================================================================
input_2 (InputLayer)         [(None, 30)]              0         
_________________________________________________________________
embedding_1 (Embedding)      (None, 30, 50)            5596600   
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 30, 50)            0         
_________________________________________________________________
bidirectional_1 (Bidirection (None, 30, 10)            2240      
_________________________________________________________________
time_distributed_1 (TimeDist (None, 30, 11)            121       
=================================================================
Total params: 7,598,961 # MUCH BIGGER NUMBER
Trainable params: 5,598,961
Non-trainable params: 0

在创建自己的 word2idx 时,我想处理词汇中缺失的单词,但我唯一做的就是减慢了模型的训练速度。

我该如何处理这种问题?如何处理缺失/不存在/未知的单词?

标签: pythontensorflowkerasword2vecnamed-entity-recognition

解决方案


为了社区的利益,Patrick 在评论部分提到的答案部分提到了它,这也是处理“OOV”的另一种方法。

text = [word2idx[w] for w in text] => text = [word2idx.get(w, "UNKNOWN_WORD") for w in text] 将省略关键错误 - 所有未知单词都将替换为 "UNKNOWN_WORD" 你可以添加为新标签,或者您可以执行 text = [word2idx[w] for w in text if w in word2idx] 以消除所有未知单词。

下面是来自加载的预训练嵌入向量的 300 维嵌入矩阵的图示。下面的矩阵将为词汇表之外的单词返回全零矩阵。

embedding_matrix = np.zeros((vocabulary_size,300))
for word,index in tokenizer.word_index.items():
  if index > vocabulary_size -1:
    break
  else:
      if word in index2word:
        embedding_matrix[index] = pretrained_model[word]
      else:
        pass

推荐阅读