首页 > 解决方案 > 如何在 Watson Studio 中使用 pandas read_csv 读取压缩的 csv 文件?

问题描述

要在我的本地 Jupyter 笔记本中读取带有 pandas 的 zip 压缩 csv 文件,我执行:

import pandas as pd
pd.read_csv('csv_file.zip')

但是,在 Watson Studio 中,read_csv()当我将文件名替换为云对象存储流对象时会引发异常。

这是我在 Watson Studio 中笔记本的第一个单元格:

import types
from ibm_botocore.client import Config
import ibm_boto3

def __iter__(self): return 0

client = ibm_boto3.client(service_name='s3', ibm_api_key_id='...',
    ibm_auth_endpoint="...", config=Config(signature_version='oauth'),
    endpoint_url='...')

body = client.get_object(Bucket='...', Key='csv_file.zip')['Body']
if not hasattr(body, "__iter__"):
    body.__iter__ = types.MethodType( __iter__, body )

现在,当我尝试:

import pandas as pd
df = pd.read_csv(body)

我得到:

'utf-8' codec can't decode byte 0xbb in position 0: invalid start byte

如果我指定compression='zip'

import pandas as pd
df = pd.read_csv(body, compression='zip')

消息是:

'StreamingBody' object has no attribute 'seek'

在 Watson Studio 中是否有直接的方法来read_csv()压缩文件,而无需显式编写解包代码?

pd.__version__0.21.0两种环境中都是。)

标签: pythonpandasdataframezipfilewatson-studio

解决方案


如果您的文件已添加为 Watson Studio 项目的数据资产,则以下过程有效。

  1. Create a project token for your project. In your project, go to Settings, navigate to the Access tokens section and click in the option New token (it is enough to select "Viewer" in the "Access role for project" dropdown menu there).

  2. Now, in your notebook in "edit" mode, there are three dots (⋮</kbd>) on the top right corner of the screen and there you click insert your token. A new first cell will be added with your project credentials, now you run it.

  3. Now you can use a code like this:

file = project.get_file("my_compressed_csv.zip")
df = pd.read_csv(file, compression='zip')

The read_csv() option does not work directly in this situation in Watson Studio, so you need to use the project-lib library.


推荐阅读