dask - Dask ValueError:架构不同
问题描述
我的问题非常接近这个问题:
我使用“pyarrow”引擎将 csv 文件转换为镶木地板。读取文件时出现架构错误。与上一个问题不同,似乎某些镶木地板文件添加了原始文件中没有的新列。
ddf = dd.read_parquet('snappywork',
columns = colnames
)
Traceback (most recent call last):
File "<input>", line 2, in <module>
File "C:\Users\gunsu.son\AppData\Local\Programs\Python\Python37\lib\site-packages\dask\dataframe\io\parquet.py", line 1397, in read_parquet
infer_divisions=infer_divisions,
File "C:\Users\gunsu.son\AppData\Local\Programs\Python\Python37\lib\site-packages\dask\dataframe\io\parquet.py", line 828, in _read_pyarrow
paths, filesystem=get_pyarrow_filesystem(fs), filters=filters
File "C:\Users\gunsu.son\AppData\Local\Programs\Python\Python37\lib\site-packages\pyarrow\parquet.py", line 1008, in __init__
self.validate_schemas()
File "C:\Users\gunsu.son\AppData\Local\Programs\Python\Python37\lib\site-packages\pyarrow\parquet.py", line 1061, in validate_schemas
dataset_schema))
ValueError: Schema in snappywork\part.129.parquet was different.
id: string
link_id: string
parent_id: string
body: string
author: string
score: string
subreddit: string
stickied: bool
created_time: string
__index_level_0__: string
metadata
--------
{b'pandas': b'{"index_columns": ["__index_level_0__"], "column_indexes": [{"na'
b'me": null, "field_name": null, "pandas_type": "unicode", "numpy_'
b'type": "object", "metadata": {"encoding": "UTF-8"}}], "columns":'
b' [{"name": "id", "field_name": "id", "pandas_type": "unicode", "'
b'numpy_type": "object", "metadata": null}, {"name": "link_id", "f'
b'ield_name": "link_id", "pandas_type": "unicode", "numpy_type": "'
b'object", "metadata": null}, {"name": "parent_id", "field_name": '
b'"parent_id", "pandas_type": "unicode", "numpy_type": "object", "'
b'metadata": null}, {"name": "body", "field_name": "body", "pandas'
b'_type": "unicode", "numpy_type": "object", "metadata": null}, {"'
b'name": "author", "field_name": "author", "pandas_type": "unicode'
b'", "numpy_type": "object", "metadata": null}, {"name": "score", '
b'"field_name": "score", "pandas_type": "unicode", "numpy_type": "'
b'object", "metadata": null}, {"name": "subreddit", "field_name": '
b'"subreddit", "pandas_type": "unicode", "numpy_type": "object", "'
b'metadata": null}, {"name": "stickied", "field_name": "stickied",'
b' "pandas_type": "bool", "numpy_type": "bool", "metadata": null},'
b' {"name": "created_time", "field_name": "created_time", "pandas_'
b'type": "unicode", "numpy_type": "object", "metadata": null}, {"n'
b'ame": null, "field_name": "__index_level_0__", "pandas_type": "u'
b'nicode", "numpy_type": "object", "metadata": null}], "creator": '
b'{"library": "pyarrow", "version": "0.14.0"}, "pandas_version": "'
b'0.25.0"}'}
vs
id: string
link_id: string
parent_id: string
body: string
author: string
score: string
subreddit: string
stickied: bool
created_time: string
metadata
--------
{b'pandas': b'{"index_columns": [{"kind": "range", "name": null, "start": 0, "'
b'stop": 248538, "step": 1}], "column_indexes": [{"name": null, "f'
b'ield_name": null, "pandas_type": "unicode", "numpy_type": "objec'
b't", "metadata": {"encoding": "UTF-8"}}], "columns": [{"name": "i'
b'd", "field_name": "id", "pandas_type": "unicode", "numpy_type": '
b'"object", "metadata": null}, {"name": "link_id", "field_name": "'
b'link_id", "pandas_type": "unicode", "numpy_type": "object", "met'
b'adata": null}, {"name": "parent_id", "field_name": "parent_id", '
b'"pandas_type": "unicode", "numpy_type": "object", "metadata": nu'
b'll}, {"name": "body", "field_name": "body", "pandas_type": "unic'
b'ode", "numpy_type": "object", "metadata": null}, {"name": "autho'
b'r", "field_name": "author", "pandas_type": "unicode", "numpy_typ'
b'e": "object", "metadata": null}, {"name": "score", "field_name":'
b' "score", "pandas_type": "unicode", "numpy_type": "object", "met'
b'adata": null}, {"name": "subreddit", "field_name": "subreddit", '
b'"pandas_type": "unicode", "numpy_type": "object", "metadata": nu'
b'll}, {"name": "stickied", "field_name": "stickied", "pandas_type'
b'": "bool", "numpy_type": "bool", "metadata": null}, {"name": "cr'
b'eated_time", "field_name": "created_time", "pandas_type": "unico'
b'de", "numpy_type": "object", "metadata": null}], "creator": {"li'
b'brary": "pyarrow", "version": "0.14.0"}, "pandas_version": "0.25'
b'.0"}'}
对于 parquet 129 文件,它似乎生成了一个新列“ index_level_0 ”。显式提供 dtypes 并不能解决这个问题。我该如何解决这个问题?
解决方案
根据@matthew-son的评论,设置引擎以fastparquet
帮助我在服务器上移动镶木地板文件时克服这个问题。
注意:您可能需要安装fastparquet
并python-snappy
使其正常工作
pip install fastparquet python-snappy
然后在 Python 中:
import dask.dataframe as dd
df = dd.read_parquet('*.parquet', engine='fastparquet')
# continue using dask / pandas
推荐阅读
- apache - 有没有在 Apache Windows 上设置 ssl 证书的地方?我发现错误 SSLPassPhraseDialog builtin is not supported on Win32
- flutter - 如何使用颤振 http 包设置基本 url 类?
- python - 尝试嵌入 TkInter 时,Matplotlib 3D-plot 不是交互式的或在附加窗口中
- python - 如何在python中通过IP地址跟踪位置
- unity3d - Unity 3D 控制台说:语法错误,“(”预期
- python-3.x - 在python中将文本转换为语音时出错
- spotify - 有没有办法在本地与 Spotify 客户端进行通信?
- python - Python的随机森林分类器错误 - 索引超出范围
- android - Firstick tv 的 jitsi 会议通话中没有声音只有视频
- javascript - 出现在导航栏顶部的英雄文本