python - combining structured and text data in classification problem using keras
问题描述
The following code is a very simple example of using word embedding to predict the labels (see below). The example is taken from here.
from numpy import array
from keras.preprocessing.text import one_hot
from keras.preprocessing.sequence import pad_sequences
from keras.models import Sequential
from keras.layers import Dense
from keras.layers import Flatten
from keras.layers.embeddings import Embedding
# define documents
docs = ['Well done!',
'Good work',
'Great effort',
'nice work',
'Excellent!',
'Weak',
'Poor effort!',
'not good',
'poor work',
'Could have done better.']
# define class labels
labels = array([1,1,1,1,1,0,0,0,0,0])
# integer encode the documents
vocab_size = 50
encoded_docs = [one_hot(d, vocab_size) for d in docs]
print(encoded_docs)
# pad documents to a max length of 4 words
max_length = 4
padded_docs = pad_sequences(encoded_docs, maxlen=max_length, padding='post')
print(padded_docs)
# define the model
model = Sequential()
model.add(Embedding(vocab_size, 8, input_length=max_length))
model.add(Flatten())
model.add(Dense(1, activation='sigmoid'))
# compile the model
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# summarize the model
print(model.summary())
# fit the model
model.fit(padded_docs, labels, epochs=50, verbose=0)
# evaluate the model
loss, accuracy = model.evaluate(padded_docs, labels, verbose=0)
print('Accuracy: %f' % (accuracy*100))
Let us say we have structured data like this:
hours_of_revision = [10, 5, 7, 3, 100, 0, 1, 0.5, 4, 0.75]
Here every entry aligns with each row showing nicely that one should really spend more time to revise to achieve good marks (-:
Just wondering, could one incorporate this into the model to use the text and structured data?
解决方案
是的,这可以通过 Keras 的功能 API 实现。hours_of_revision
在进入最终分类器之前,您所需要的只是与文本数据中的嵌入连接的附加输入。
首先缩放附加数据:
# additional data
hours_of_revision = [10, 5, 7, 3, 100, 0, 1, 0.5, 4, 0.75]
import numpy as np
# Scale the data
mean = np.mean(hours_of_revision)
std = np.std(hours_of_revision)
hours_of_revision = (hours_of_revision - mean)/std
使用功能 API 构建模型:
# Build model
from keras.layers import Input, Embedding, Flatten, Dense, Concatenate
from keras.models import Model
# Two input layers
integer_input = Input((max_length, ))
revision_input = Input((1,))
# Embedding layer for the words
embedding = Embedding(vocab_size, 8, input_length=max_length)(integer_input)
embedding_flat = Flatten()(embedding)
# Concatenate embedding with revision
combined_data = Concatenate()([embedding_flat, revision_input])
output = Dense(1, activation='sigmoid')(combined_data)
# compile the model - pass a list of input tensors
model = Model(inputs=[integer_input, revision_input], outputs=output)
model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['acc'])
# fit the model - pass list of input data
model.fit([padded_docs, hours_of_revision], labels, epochs=50, verbose=0)
有关如何将功能 API 用于多输入/多输出模型的更多示例,请查看Keras 文档。
推荐阅读
- linux - Building electron linux distro : The SUID sandbox helper binary was found, but is not configured correctly
- linux - 增加根文件系统磁盘空间
- google-cloud-firestore - In Firestore Security Rules, how do I compare array values with map keys?
- python - 将列表切片到多个切片中
- c - 为什么 strcat() 会导致闲置代码中的分段错误?
- multithreading - 没有租赁例外
- javascript - 是否可以在边框上添加图像?
- c# - Azure 搜索文档添加自定义分析器、标记器和标记过滤器
- vue.js - Vues - 如何在渲染函数中使用 v-for 和作用域插槽
- c++ - 将 SDL2 库与 cl.exe 编译器一起使用