首页 > 解决方案 > 将 Pandas df 写入 Pyarrow Parquet 表会导致“越界”时间戳问题

问题描述

out of bounds timestamp尝试将 pandas 数据帧转换为 pyarrow 表并写入 parquet 数据集时,我收到一条错误消息。从一些研究来看,我相信这似乎是熊猫使用纳秒精度和 pyarrow 只能解释到毫秒精度的结果。

import cx_Oracle
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

connection = cx_Oracle.connect(os.getenv('USER'), os.getenv('__OPW'), os.getenv('DB_SERVICE'))
gen = pd.read_sql('SELECT * FROM myschema.mytable where rownum < 10001', con=connection, chunksize=1_000)
for df in gen:
    table = pa.Table.from_pandas(df)
    pq.write_to_dataset(table, root_path='/tmp/dataset', partition_cols=['my_part_col'])

ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 253402214400000000

当我注释掉最后一行时:

# pq.write_to_dataset(table, root_path='/tmp/dataset', partition_cols=['my_part_col'])

...并重新运行,不再产生错误消息,因此它可能是从 pyarrow 表转换为镶木地板时发生的。

是否有已知的解决方法?

谢谢。

更新:

这是完整的追溯...

Traceback (most recent call last):
  File "<stdin>", line 3, in <module>
  File "/Users/myusername/miniconda3/envs/py38/lib/python3.8/site-packages/pyarrow/parquet.py", line 1754, in write_to_dataset
    df = table.to_pandas()
  File "pyarrow/array.pxi", line 715, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 1565, in pyarrow.lib.Table._to_pandas
  File "/Users/myusername/miniconda3/envs/py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 779, in table_to_blockmanager
    blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
  File "/Users/myusername/miniconda3/envs/py38/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 1114, in _table_to_blocks
    result = pa.lib.table_to_blocks(options, block_table, categories,
  File "pyarrow/table.pxi", line 1028, in pyarrow.lib.table_to_blocks
  File "pyarrow/error.pxi", line 84, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: Casting from timestamp[us] to timestamp[ns] would result in out of bounds timestamp: 253402214400000000

标签: pythonpandasdataframeparquetpyarrow

解决方案


从纪元开始的 253402214400000000 微秒是 10`000 年。

很少有库支持此范围的时间戳。你有几个选择:

  • 在转换为箭头/镶木地板之前截断所有超出范围的值
  • 将有问题的列转换为 int64 或 uint64(而不是使用时间戳)
  • 使用日期而不是时间戳。如果您展望那么远的未来,您可能不在乎现在是什么时间。日期有更大的范围。

编辑:

如果这是您的数据库表示无效/缺失日期的方式,则应pd.NaT在转换为箭头之前将所有这些日期替换为。


推荐阅读