python - 制作 NER 模型时处理生词
问题描述
我正在研究我在 Python 的 Keras 库中制作的自定义命名实体识别模型。我已经读过我应该枚举所有出现的单词,以便获得矢量化序列。我已经这样做了:
word2idx = {w: i + 1 for i, w in enumerate(words)}
label2idx = {t: i for i, t in enumerate(labels)}
# CREATING FEATURES(X) AND RESULTS(Y)
max_len = 50
num_words = len(num_words) #number of unique words in dataset
X = [[word2idx[w[0]] for w in s] for s in list_of_sentances]
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=num_words-1)
y = [[label2idx[w[1]] for w in s] for s in list_of_sentances]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=label2idx["O"])
y = [to_categorical(i, num_classes=num_labels) for i in y]
这是我的最终模型:
input_word = Input(shape=(max_len,))
model = Embedding(input_dim = num_words, output_dim = 50, input_length = max_len)(input_word)
model = SpatialDropout1D(0.2)(model)
model = Bidirectional(LSTM(units = 5, return_sequences=True, recurrent_dropout = 0.1))(model)
out = TimeDistributed(Dense(num_labels, activation = "softmax"))(model)
model = Model(input_word, out)
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_1 (InputLayer) [(None, 30)] 0
_________________________________________________________________
embedding (Embedding) (None, 30, 50) 2187550
_________________________________________________________________
spatial_dropout1d (SpatialDr (None, 30, 50) 0
_________________________________________________________________
bidirectional (Bidirectional (None, 30, 10) 2240
_________________________________________________________________
time_distributed (TimeDistri (None, 30, 11) 121
=================================================================
Total params: 2,189,911 #LOOK AD THIS NUMBER
Trainable params: 2,189,911
Non-trainable params: 0
我的准确率为 98%,损失为 0.07。我喜欢这些结果,但由于缺少单词,我无法做出预测。例如:
text = "I live in the Ohio and my name is Alex Wright and I work in AvcCC LTD"
text = text.split()
text = [word2idx[w] for w in text]
text = np.array(text)
print(text)
text=text.reshape(1,text.shape[0])
max_len = 50
text = pad_sequences(maxlen=max_len, sequences=text, padding="post", value=num_words-1)
print('PREDICTION')
res = model.predict(text).argmax(axis=-1)[0]
print(res)
错误:
KeyError: 'AvcCC'
在我的数据集中和词汇中没有“AvcCC”这个词,如何处理?
我想在生产中使用该代码/模型。由于我的 word2idx 仅包含起始数据中的单词,我如何处理不在我的 word2idx 词汇表中的单词?例如,我的 word2idx 词汇表不可能包含所有存在的姓名和姓氏,或者所有城市/位置、所有公司名称、俚语等。
我的词汇表有大约 40k 个枚举单词(这是我的数据集中唯一单词的数量)。然后,我用超过 100k 的其他词丰富了它。(我做了一个爬取不同类型新闻文章的网络爬虫)。所以现在,我的词汇有大约 14 万个单词。现在,我不是从数据集中枚举唯一的单词,而是加载我的新 word2idx/vocabulary。
word2idx = open('english-vocab.json')
word2idx = json.load(word2idx)
max_len = 50
num_words = len(num_words) #number of unique words in dataset
X = [[word2idx[w[0]] for w in s] for s in list_of_sentances]
X = pad_sequences(maxlen=max_len, sequences=X, padding="post", value=num_words-1)
y = [[label2idx[w[1]] for w in s] for s in list_of_sentances]
y = pad_sequences(maxlen=max_len, sequences=y, padding="post", value=label2idx["O"])
y = [to_categorical(i, num_classes=num_labels) for i in y]
准确度和损失保持不变,但由于总参数,我的模型变得更慢(我不能再使用 num_words 因为它显示错误,我需要使用len(word2idx)
)
input_word = Input(shape=(max_len,))
model = Embedding(input_dim = len(word2idx), output_dim = 50, input_length = max_len)(input_word)
model = SpatialDropout1D(0.2)(model)
model = Bidirectional(LSTM(units = 5, return_sequences=True, recurrent_dropout = 0.1))(model)
out = TimeDistributed(Dense(num_labels, activation = "softmax"))(model)
model = Model(input_word, out)
model.summary()
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
input_2 (InputLayer) [(None, 30)] 0
_________________________________________________________________
embedding_1 (Embedding) (None, 30, 50) 5596600
_________________________________________________________________
spatial_dropout1d_1 (Spatial (None, 30, 50) 0
_________________________________________________________________
bidirectional_1 (Bidirection (None, 30, 10) 2240
_________________________________________________________________
time_distributed_1 (TimeDist (None, 30, 11) 121
=================================================================
Total params: 7,598,961 # MUCH BIGGER NUMBER
Trainable params: 5,598,961
Non-trainable params: 0
在创建自己的 word2idx 时,我想处理词汇中缺失的单词,但我唯一做的就是减慢了模型的训练速度。
我该如何处理这种问题?如何处理缺失/不存在/未知的单词?
解决方案
为了社区的利益,Patrick 在评论部分提到的答案部分提到了它,这也是处理“OOV”的另一种方法。
text = [word2idx[w] for w in text] => text = [word2idx.get(w, "UNKNOWN_WORD") for w in text] 将省略关键错误 - 所有未知单词都将替换为 "UNKNOWN_WORD" 你可以添加为新标签,或者您可以执行 text = [word2idx[w] for w in text if w in word2idx] 以消除所有未知单词。
下面是来自加载的预训练嵌入向量的 300 维嵌入矩阵的图示。下面的矩阵将为词汇表之外的单词返回全零矩阵。
embedding_matrix = np.zeros((vocabulary_size,300))
for word,index in tokenizer.word_index.items():
if index > vocabulary_size -1:
break
else:
if word in index2word:
embedding_matrix[index] = pretrained_model[word]
else:
pass
推荐阅读
- javascript - 是否可以在 iOS 上的 Edge 中下载 blob 文件?
- python - 属性错误:'str' 对象没有属性'read' python-django
- git - 源树:远程:用户名或密码无效。致命:“**MYURL**”的身份验证失败
- c++ - 如何将可执行文件转换为 c++ 文件以在代码中达到峰值
- javascript - 图像折叠时不会改变(jquery,css)
- java - 问题要求为匹配的括号返回 true,为不匹配的括号返回 false
- jmeter - JMeter:网络延迟、CPU 使用率和内存
- javascript - 如何旋转和调整画布大小,同时保持旋转的纵横比和调整高度?
- .net-core - 带有 WPF 应用程序和 .NET Core Web 应用程序的 .Net Standard 类库
- python-3.x - 在 Python 中过滤具有多个条件的大型数组的 Pandas 数据框的值