python - Tensorflow a2.0.0:将 CSV 转换为 tfrecord,创建使用来自大型源的管道数据的 Keras 模型,将权重存储到 CSV 文件?
问题描述
我正在从 Andrew NG 在 Coursera 上的讲座中学习机器学习。该课程使用 Matlab,它非常适合理解机器学习模型并对其进行原型设计,但速度相当慢。我目前正在研究 Tensorflow,因为它支持 GPU 利用率和数据流水线,这应该可以加速我的模型。
但是,我完全迷失了这一点。文档没有详细介绍,示例代码没有注释,最重要的是,Tensorflow 刚刚发布了一个 Alpha2.0,它显着改变了 API(很多旧的 StackOverflow 线程都没有帮助)。
我的目标是:
- 将大 (10GB+) CSV 文件转换为 tfrecords(在某处发现这是有益的?)
- 创建一个 ks.dataset,它在多个线程中读取数据并将其通过管道传输到模型
- 使用我的 GPU 创建一个从所述数据集中学习的模型
- 将学习参数导出到文件
现在,我只能构建 keras 模型
model = keras.Sequential([
keras.layers.Conv2D(filters=3, activation='relu',
kernel_regularizer=keras.regularizers.l2(0.001),
kernel_size=28,
padding="same",
input_shape=(28, 28, 1)),
keras.layers.Flatten(),
keras.layers.Dropout(0.09),
keras.layers.Dense(10, activation='softmax', kernel_regularizer=keras.regularizers.l2(lambd)),
keras.layers.Dropout(0.09)])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
model.fit(x_train, y_train, epochs=47, batch_size=256)
test_loss, test_acc = model.evaluate(x_test, y_test)
print('\nTest accuracy:', test_acc)
在这一点上,任何事情都会有所帮助!我应该研究哪些功能对我的任何目标都至关重要?
解决方案
经过 24 小时不间断的研究,我终于把所有的碎片都粘在了拼图上。API 很棒,但是缺少文档。
将 CSV 转换为 tfrecord:
import tensorflow as tf
import numpy as np
import pandas as pd # For reading .csv
from datetime import datetime # For knowing how long does each read/write take
def _bytes_feature(value):
# Returns a bytes_list from a string / byte.
if isinstance(value, type(tf.constant(0))):
value = value.numpy() # BytesList won't unpack a string from an EagerTensor.
return tf.train.Feature(bytes_list=tf.train.BytesList(value=[value]))
def _float_feature(value):
# Returns a float_list from a float / double.
# If a list of values was passed, a float list feature with the entire list will be returned
if isinstance(value, list):
return tf.train.Feature(float_list=tf.train.FloatList(value=value))
return tf.train.Feature(float_list=tf.train.FloatList(value=[value]))
def _int64_feature(value):
# Returns an int64_list from a bool / enum / int / uint.
return tf.train.Feature(int64_list=tf.train.Int64List(value=[value]))
def serialize_example(pandabase):
# Serializes inputs from a pandas dataset (read in chunks)
# Creates a mapping of the features from the header row of the file
base_chunk = pandabase.get_chunk(0)
num_features = len(base_chunk.columns)
features_map = {}
for i in range(num_features):
features_map.update({'feature' + str(i): _float_feature(0)})
# Set writing options with compression
options = tf.io.TFRecordOptions(compression_type=tf.io.TFRecordCompressionType.ZLIB,
compression_level=9)
with tf.io.TFRecordWriter('test2.tfrecord.zip', options=options) as writer:
# Convert the chunk to a numpy array, and write each row to the file in a double for loop
for chunk in pandabase:
nump = chunk.to_numpy()
for row in nump:
ii = 0
for elem in row:
features_map['feature' + str(ii)] = _float_feature(float(elem))
ii += 1
myProto = tf.train.Example(features=tf.train.Features(feature=features_map))
writer.write(myProto.SerializeToString())
start = datetime.now()
bk1 = pd.read_csv("Book2.csv", chunksize=2048, engine='c', iterator=True)
serialize_example(bk1)
end = datetime.now()
print("- consumed time: %ds" % (end-start).seconds)
对于从 tfrecords 进行机器学习并使用 GPU:按照本指南进行正确设置,然后使用以下代码:
# Recreate the feature mappings (Must be similar to the one used to write the tfrecords)
_NUMCOL = 5
feature_description = {}
for i in range(_NUMCOL):
feature_description.update({'feature' + str(i): tf.io.FixedLenFeature([], tf.float32)})
# Parse the tfrecords into the form (x, y) or (x, y, weights) to be used with keras
def _parse_function(example_proto):
dic = tf.io.parse_single_example(example_proto, feature_description)
y = dic['feature0']
x = tf.stack([dic['feature1'],
dic['feature2'],
dic['feature3'],
dic['feature4']], axis=0)
return x, y
# Let tensorflow autotune the training speed
AUTOTUNE = tf.data.experimental.AUTOTUNE
# creat a tfdataset from the recorded file, set parallel reads to number of cores for best running speed
myData = tf.data.TFRecordDataset('test.tfrecord.zip', compression_type='ZLIB',
num_parallel_reads=2)
# Map the data to a form useable by keras (using _parse_function), cache the data, shuffle, and read the data in batches
myData = myData.map(_parse_function, num_parallel_calls=AUTOTUNE)
myData = myData.cache()
myData = myData.shuffle(buffer_size=8192)
batches = 16385
myData = myData.batch(batches).prefetch(buffer_size=AUTOTUNE)
model = keras.Sequential([
keras.layers.Dense(100, activation='softmax', kernel_regularizer=keras.regularizers.l2(lambd)),
keras.layers.Dense(10, activation='softmax', kernel_regularizer=keras.regularizers.l2(lambd)),
keras.layers.Dense(1, activation='linear', kernel_regularizer=keras.regularizers.l2(lambd))])
model.compile(optimizer='adam',
loss='mean_squared_error')
model.save('keras.HD5F')
推荐阅读
- mapbox - 找不到 fragment.jar (androidx.fragment:fragment:1.1.0)
- extjs - Sencha CMD - 如何从 Admin Dashboard 模板仅构建 MODERN 应用程序?
- groovy - groovy 脚本从文件中读取 jenkins 作业列表并更新其配置
- python - samtools - dyld:库未加载:@rpath/libcrypto.1.0.0.dylib
- reactjs - React Router Deep Link 与动态
- python - Python – hashlib.blake2b-256/512?
- python - 在python中将Discord用户设置为AFK
- javascript - 是否有使用 WebRTC 进行网络间文本传输的工作示例?
- java - 如何将 JLabel 变成按钮?
- java - Java WebServer 拒绝连接