python - 从 Python 生成器重复 tf.data.Dataset.from_generator(),迭代数据库
问题描述
我有一个面板数据集,我想对其进行长短期记忆(LSTM)。数据集来自一个 postgreSQL 数据库。我的数据结构类似于以下内容:
因此,我的时间步长是 4。它是多对多 LSTM,我的输入和输出都是序列。输入将具有形状[Batch_size, 4, 23]
,输出将具有形状[Batch_size, 4, 2]
(我是 one-hot 编码)。
我正在使用 Python 生成器来获取行。我正在获取许多(fetchmany
),其中 number_of_records 为 4,因为它对应于一个特定的人。
class it_try:
import passwords_and_paths
import psycopg2
def __init__(self, sql, number_of_records):
self.sql = sql
self.number_of_records = number_of_records
self.pgConnectString = "host='/var/run/postgresql' port='{}' dbname='{}' user='{}' password='{}'".format(it_try.passwords_and_paths.database['port'],
it_try.passwords_and_paths.database['name'],
it_try.passwords_and_paths.database['user'],
it_try.passwords_and_paths.database['pass'])
self.pgConnection=psycopg2.connect(self.pgConnectString)
self.pgCursor = self.pgConnection.cursor(name='fetch_large_result')
self.pgCursor.execute(self.sql)
def __iter__(self):
return self
def __next__(self):
row = self.pgCursor.fetchmany(self.number_of_records)
current_obs = []
for i in row:
current_obs.append(i)
features = np.array(current_obs)[:,3:26]
labels = np.array(current_obs)[:,-1].astype(int)
return features, labels
def __del__(self):
self.pgCursor.close()
特征[4,23]
的形状是 ,标签的形状是[4,]
。tf.data.Dataset.from_generator()
然后我使用Tensorflow 的函数从生成器初始化数据集。形状和数据类型定义正确,我在这里对标签进行一次性编码,每次调用批处理 3 人。
generator = it_try(sql = 'SELECT * FROM public.basetable order by year, customer_id, quarter', number_of_records = 4)
train_dataset = tf.data.Dataset.from_generator(lambda: generator, (tf.float32, tf.int32), (tf.TensorShape([4,23]), tf.TensorShape([4,])))
train_dataset=train_dataset.map(lambda *x:(x[0], tf.cast(tf.one_hot(x[1],2),tf.int32)))
train_dataset = train_dataset.batch(3)
输出是<BatchDataset shapes: ((?, 4, 23), (?, 4, 2)), types: (tf.float32, tf.int32)>
。到目前为止,一切都很好。
我创建了迭代器并对其进行了初始化,我可以成功打印批次(本例中为 2 个批次)。
iterator = tf.data.Iterator.from_structure(train_dataset.output_types,
train_dataset.output_shapes)
X, y = iterator.get_next()
training_init_op = iterator.make_initializer(train_dataset)
with tf.Session() as sess:
sess.run(training_init_op)
for batch in range(2):
print(sess.run([X,y]))
但是,当我想多次传递训练数据时(本例中 epoch 数为 2),我得到一个错误,当然,这是因为我无法重置我的 Python 和 Tensorflow 迭代器。
with tf.Session() as sess:
for epoch in range(2):
sess.run(training_init_op)
for batch in range(2):
print(sess.run([X,y]))
它可以很好地打印第一个纪元,但是当它是第二个纪元时,我得到了错误。
---------------------------------------------------------------------------
UnknownError Traceback (most recent call last)
/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1321 try:
-> 1322 return fn(*args)
1323 except errors.OpError as e:
/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in _run_fn(feed_dict, fetch_list, target_list, options, run_metadata)
1306 return self._call_tf_sessionrun(
-> 1307 options, feed_dict, fetch_list, target_list, run_metadata)
1308
/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in _call_tf_sessionrun(self, options, feed_dict, fetch_list, target_list, run_metadata)
1408 self._session, options, feed_dict, fetch_list, target_list,
-> 1409 run_metadata)
1410 else:
UnknownError: IndexError: too many indices for array
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/script_ops.py", line 158, in __call__
ret = func(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 410, in generator_py_func
values = next(generator_state.get_iterator(iterator_id))
File "<ipython-input-64-e6c5163f3adc>", line 26, in __next__
features = np.array(current_obs)[:,3:26]
IndexError: too many indices for array
[[Node: PyFunc = PyFunc[Tin=[DT_INT64], Tout=[DT_FLOAT, DT_INT32], token="pyfunc_46"](arg0)]]
[[Node: IteratorGetNext_23 = IteratorGetNext[output_shapes=[[?,4,23], [?,4,2]], output_types=[DT_FLOAT, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator_23)]]
During handling of the above exception, another exception occurred:
UnknownError Traceback (most recent call last)
<ipython-input-67-213eeaa1c283> in <module>()
7 sess.run(training_init_op)
8 for i in range(2):
----> 9 print(sess.run([X,y]))
/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in run(self, fetches, feed_dict, options, run_metadata)
898 try:
899 result = self._run(None, fetches, feed_dict, options_ptr,
--> 900 run_metadata_ptr)
901 if run_metadata:
902 proto_data = tf_session.TF_GetBuffer(run_metadata_ptr)
/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in _run(self, handle, fetches, feed_dict, options, run_metadata)
1133 if final_fetches or final_targets or (handle and feed_dict_tensor):
1134 results = self._do_run(handle, final_targets, final_fetches,
-> 1135 feed_dict_tensor, options, run_metadata)
1136 else:
1137 results = []
/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in _do_run(self, handle, target_list, fetch_list, feed_dict, options, run_metadata)
1314 if handle is None:
1315 return self._do_call(_run_fn, feeds, fetches, targets, options,
-> 1316 run_metadata)
1317 else:
1318 return self._do_call(_prun_fn, handle, feeds, fetches)
/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py in _do_call(self, fn, *args)
1333 except KeyError:
1334 pass
-> 1335 raise type(e)(node_def, op, message)
1336
1337 def _extend_graph(self):
UnknownError: IndexError: too many indices for array
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/script_ops.py", line 158, in __call__
ret = func(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/data/ops/dataset_ops.py", line 410, in generator_py_func
values = next(generator_state.get_iterator(iterator_id))
File "<ipython-input-64-e6c5163f3adc>", line 26, in __next__
features = np.array(current_obs)[:,3:26]
IndexError: too many indices for array
[[Node: PyFunc = PyFunc[Tin=[DT_INT64], Tout=[DT_FLOAT, DT_INT32], token="pyfunc_46"](arg0)]]
[[Node: IteratorGetNext_23 = IteratorGetNext[output_shapes=[[?,4,23], [?,4,2]], output_types=[DT_FLOAT, DT_INT32], _device="/job:localhost/replica:0/task:0/device:CPU:0"](Iterator_23)]]
我试过.repeat(2)
无济于事。
任何人都可以帮我解决这个问题吗?当我使用 python 迭代器时如何运行 epochs,其中数据来自数据库?
解决方案
推荐阅读
- python - Docker/Flask:开发服务器在 localhost 上运行良好,但无法访问外部 URL
- mysql-python - SaltStack:MySQL返回器将更少的数据保存到数据库表列中
- python - 如何使用 URL 复制图像并粘贴到 Excel(通过 xlwings)而不先下载图像?
- php - 验证或 if 语句
- python - 调整条形图 x 轴以匹配度量而不是分类比例
- python - 使用 python SSH 到 nginx 负载均衡器
- java - Java继承-访问父类和子类
- python - 单击一个按钮并前进到修补程序中窗口的限制
- mysql - MYSQL JSON 路径字段问题
- datetime - Dart Lang:奇怪的结果增加了 7 天。有时计算不好