python - 无法使用熊猫时间戳编写有效的镶木地板文件
问题描述
当我尝试使用 pandas.Timestamp 作为索引编写镶木地板文件时,我无法将其读回。我收到有关时间戳转换的错误。
pytz.exceptions.UnknownTimeZoneError: '-5:00'
我已经针对 pyarrow 0.11.1 运行了这段代码,它工作正常,当切换到 0.15.0 时它会中断。我认为这就是 pyarrow 处理日期时间对象的方式,而是通过 pytz 解析时区。
此代码适用于 pyarrow 0.11.1,适用于 0.15.0
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
import json
myjson = ['{"received_time":"2019-10-13T00:00:09.915-05:00","mydata":1.0}',
'{"received_time":"2019-10-13T00:00:10.915-05:00","mydata":1.0}',
'{"received_time":"2019-10-13T00:00:11.915-05:00","mydata":1.0}']
iterator = 0
q=[]
for i in myjson:
q.append(json.loads(i))
iterator = iterator + 1
df = pd.DataFrame.from_records(q)
tindex = pd.to_datetime(df['received_time'],format='%Y-%m-%dT%H:%M:%S.%f%z')
df['received_time'] = tindex
df.set_index('received_time',inplace=True)
tbl = pa.Table.from_pandas(df)
pq.write_table(tbl,'tmp.pq')
readdf = pd.read_parquet('tmp.pq')
我希望这可以正常工作,而不是我得到这个堆栈跟踪:
Traceback (most recent call last):
File "<input>", line 21, in <module>
File "/home/<redacted>/PycharmProjects/data_analysis/venv/lib64/python3.6/site-packages/pandas/io/parquet.py", line 296, in read_parquet
return impl.read(path, columns=columns, **kwargs)
File "/home/<redacted>/PycharmProjects/data_analysis/venv/lib64/python3.6/site-packages/pandas/io/parquet.py", line 125, in read
path, columns=columns, **kwargs
File "pyarrow/array.pxi", line 468, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 1238, in pyarrow.lib.Table._to_pandas
File "/home/<redacted>/PycharmProjects/data_analysis/venv/lib64/python3.6/site-packages/pyarrow/pandas_compat.py", line 697, in table_to_blockmanager
all_columns)
File "/home/<redacted>/PycharmProjects/data_analysis/venv/lib64/python3.6/site-packages/pyarrow/pandas_compat.py", line 778, in _reconstruct_index
table, result_table, descr, field_name_to_metadata)
File "/home/<redacted>/PycharmProjects/data_analysis/venv/lib64/python3.6/site-packages/pyarrow/pandas_compat.py", line 835, in _extract_index_level
.dt.tz_convert(col.type.tz))
File "/home/<redacted>/PycharmProjects/data_analysis/venv/lib64/python3.6/site-packages/pandas/core/accessor.py", line 93, in f
return self._delegate_method(name, *args, **kwargs)
File "/home/<redacted>/PycharmProjects/data_analysis/venv/lib64/python3.6/site-packages/pandas/core/indexes/accessors.py", line 109, in _delegate_method
result = method(*args, **kwargs)
File "/home/<redacted>/PycharmProjects/data_analysis/venv/lib64/python3.6/site-packages/pandas/core/accessor.py", line 93, in f
return self._delegate_method(name, *args, **kwargs)
File "/home/<redacted>/PycharmProjects/data_analysis/venv/lib64/python3.6/site-packages/pandas/core/indexes/datetimelike.py", line 813, in _delegate_method
result = operator.methodcaller(name, *args, **kwargs)(self._data)
File "/home/<redacted>/PycharmProjects/data_analysis/venv/lib64/python3.6/site-packages/pandas/core/arrays/datetimes.py", line 955, in tz_convert
tz = timezones.maybe_get_tz(tz)
File "pandas/_libs/tslibs/timezones.pyx", line 84, in pandas._libs.tslibs.timezones.maybe_get_tz
File "pandas/_libs/tslibs/timezones.pyx", line 99, in pandas._libs.tslibs.timezones.maybe_get_tz
File "/home/<redacted>/PycharmProjects/data_analysis/venv/lib64/python3.6/site-packages/pytz/__init__.py", line 181, in timezone
raise UnknownTimeZoneError(zone)
pytz.exceptions.UnknownTimeZoneError: '-05:00'
这是输出pandas.show_versions()
pd.show_versions()
INSTALLED VERSIONS
------------------
commit : None
python : 3.6.6.final.0
python-bits : 64
OS : Linux
OS-release : 3.10.0-862.el7.x86_64
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 0.25.1
numpy : 1.15.4
pytz : 2019.3
dateutil : 2.7.5
pip : 10.0.1
setuptools : 39.1.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.0.2
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 0.15.0
pytables : None
s3fs : None
scipy : 1.3.0
sqlalchemy : None
tables : None
xarray : None
xlrd : None
xlwt : None
xlsxwriter : None
我不确定这是 pyarrow 问题还是 pandas 问题,但我确实知道 pyarrow 切换到像这样处理日期时间:
if isinstance(col.type, pa.lib.TimestampType):
index_level = (pd.Series(values).dt.tz_localize('utc')
.dt.tz_convert(col.type.tz))
它调用了 pytz 库,而之前没有。要么是我错误地构建了 parquet 文件(在这种情况下,写入时应该出现解析错误),要么是 pandas 或 parquet 库处理时区的方式有问题。
有什么想法吗?