python - Fastparquet 在使用 dataframe.to_parquet() 时给出“TypeError:预期的 str、字节或 os.PathLike 对象,而不是 _io.BytesIO”
问题描述
我正在尝试为 AWS Lambda 创建代码以将 csv 转换为镶木地板。我可以使用 Pyarrow 做到这一点,但它的大小太大(未压缩约 200 MB),因此我无法在 Lambda 的部署包中使用它。我正在尝试使用 BytesIO 库直接将镶木地板文件写入 s3 存储桶。
下面是我的 lambda 函数代码:
import json
import boto3
import pandas as pd
from io import BytesIO
def lambda_handler():
s3 = boto3.client('s3')
response = s3.list_objects_v2(
Bucket = 'mybucket',
Prefix = 'subfolder/'
)
files = get_object_keys(response)
for file in files:
obj = s3.get_object(Bucket='mybucket', Key=file)
df = pd.read_csv(obj['Body'], sep='|')
buf = BytesIO()
df.to_parquet(buf, engine='fastparquet', index=False, compression='snappy')
buf.seek(0)
key = f"output/{file.split('/')[1].split('.')[0]}.parquet"
s3.put_object(Bucket='mybucket', Body=buf.getvalue(), Key=key)
def get_object_keys(response):
files = []
for content in response['Contents']:
if content['Key'].endswith('.csv'):
files.append(content['Key'])
return files
lambda_handler()
当我在 dataframe.to_parquet() 中使用“fastparquet”作为引擎时,出现以下错误:
Traceback (most recent call last):
File ".\lambda_function.py", line 77, in <module>
lambda_handler()
File ".\lambda_function.py", line 64, in lambda_handler
df.to_parquet(buf, engine='fastparquet', index=False, compression='snappy')
File "C:\Users\tansingh\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\util\_decorators.py", line 214, in wrapper
return func(*args, **kwargs)
File "C:\Users\tansingh\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\core\frame.py", line 2116, in to_parquet
**kwargs,
File "C:\Users\tansingh\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\parquet.py", line 264, in to_parquet
**kwargs,
File "C:\Users\tansingh\AppData\Local\Programs\Python\Python37\lib\site-packages\pandas\io\parquet.py", line 185, in write
**kwargs,
File "C:\Users\tansingh\AppData\Local\Programs\Python\Python37\lib\site-packages\fastparquet\writer.py", line 880, in write
compression, open_with, has_nulls, append)
File "C:\Users\tansingh\AppData\Local\Programs\Python\Python37\lib\site-packages\fastparquet\writer.py", line 734, in write_simple
with open_with(fn, mode) as f:
File "C:\Users\tansingh\AppData\Local\Programs\Python\Python37\lib\site-packages\fastparquet\util.py", line 42, in default_open
return open(f, mode)
TypeError: expected str, bytes or os.PathLike object, not _io.BytesIO
有谁知道如何解决这一问题?
解决方案
通过使用 pyarrow 作为编写引擎解决了这个错误。
示例代码。
buffer = ioBytesIO()
df.to_parquet(buffer, engine="pyarrow", index = False)
s3_resource = boto3.resource('s3')
s3_resouce.Object(`bucketname`, `path_withfilename`).put(body = buffer.getvalue())
为了在 python 3.6 中读取 parquet 文件,我使用了 fastparquet,但对于编写 pyarrow 引擎似乎有效。
推荐阅读
- c# - 无法在 ASP.NET MVC 5 中从视图到控制器获取 Select2 标记值
- api - 从电话号码 API 获取国家代码
- laravel - Laravel 紧凑传递类变量
- python - 从其他数据框中的值中选择行
- python - 如何根据多个分隔符拆分字符串
- remote-access - 在 redhat 7.4 中安装 anydesk anydesk-6.1.1-1.el7.x86_64.rpm 时显示错误
- ios - iOS以模态方式打开独立导航控制器
- android - 未解决的参考:Jetpact Compose 1.0.0-beta07 中的 vectorResource 和 imageResources
- delphi - Delphi - libssh2.dll 加载问题
- vue.js - 如果通过道具的数据项为空,如何使用自定义 v-data-table 处理 v-skeleton-loader?