首页 > 解决方案 > Pandas read_json() encoding = 'utf-8-sig' 选项不适用于 BytesIO 对象(类文件对象)

问题描述

当尝试将 UTF8-BOM 编码的jsonlines文件作为字节数据直接加载到 pandas 数据帧中时,出现错误“ValueError”对象没有属性“消息”(当编码不同时会发生此一般错误)。我正在尝试使用 azure.storage.filedatalake.DataLakeFileClient 从 Azure Datalake Gen-2 读取数据,它为我提供字节数据,我正在尝试直接将该数据加载到 pandas 数据帧中。下面给出了失败的代码片段

from azure.identity import ClientSecretCredential
from azure.storage.filedatalake import DataLakeServiceClient
from io import BytesIO,StringIO 



def initialize_storage_account_ad(storage_account_name, client_id, client_secret, tenant_id):
    
    try:  
        global service_client

        credential = ClientSecretCredential(tenant_id, client_id, client_secret)

        service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
            "https", storage_account_name), credential=credential)
    
    except Exception as e:
        print(e.message)

initialize_storage_account_ad(storage_account_name, client_id, client_secret, tenant_id)
data_folder = '/raw/data/'

file_system_client = service_client.get_file_system_client(file_system="dls")
paths = file_system_client.get_paths(path=data_folder)

directory_client = file_system_client.get_directory_client(data_folder)

file_client = directory_client.get_file_client('API_COUNTRY.json')
download = file_client.download_file()

downloaded_bytes = download.readall()
df = pd.read_json(BytesIO(downloaded_bytes),lines = True,encoding = 'utf-8-sig')
display(df) 

如果我尝试使用 UTF-8 编码,同样的代码也可以工作,如果我将 UTF8-BOM jsonlines 写入文件并使用它加载它,df = pd.read_json('country.json',lines = True,encoding = 'utf-8-sig')那么它也可以工作。任何帮助是极大的赞赏。

错误堆栈跟踪

ValueError                                Traceback (most recent call last)
<ipython-input-13-b150d9150c5a> in <module>
     31 
     32 downloaded_bytes = download.readall()
---> 33 df = pd.read_json(BytesIO(downloaded_bytes),lines = True,encoding = 'utf-8-sig')
     34 display(df)

C:\Program Files\Python36\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    197                 else:
    198                     kwargs[new_arg_name] = new_arg_value
--> 199             return func(*args, **kwargs)
    200 
    201         return cast(F, wrapper)

C:\Program Files\Python36\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
    294                 )
    295                 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 296             return func(*args, **kwargs)
    297 
    298         return wrapper

C:\Program Files\Python36\lib\site-packages\pandas\io\json\_json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines, chunksize, compression, nrows)
    616         return json_reader
    617 
--> 618     result = json_reader.read()
    619     if should_close:
    620         filepath_or_buffer.close()

C:\Program Files\Python36\lib\site-packages\pandas\io\json\_json.py in read(self)
    751                 data = ensure_str(self.data)
    752                 data = data.split("\n")
--> 753                 obj = self._get_object_parser(self._combine_lines(data))
    754         else:
    755             obj = self._get_object_parser(self.data)

C:\Program Files\Python36\lib\site-packages\pandas\io\json\_json.py in _get_object_parser(self, json)
    775         obj = None
    776         if typ == "frame":
--> 777             obj = FrameParser(json, **kwargs).parse()
    778 
    779         if typ == "series" or obj is None:

C:\Program Files\Python36\lib\site-packages\pandas\io\json\_json.py in parse(self)
    884 
    885         else:
--> 886             self._parse_no_numpy()
    887 
    888         if self.obj is None:

C:\Program Files\Python36\lib\site-packages\pandas\io\json\_json.py in _parse_no_numpy(self)
   1117         if orient == "columns":
   1118             self.obj = DataFrame(
-> 1119                 loads(json, precise_float=self.precise_float), dtype=None
   1120             )
   1121         elif orient == "split":

ValueError: Expected object or value

字节值的开头:

('b', ['0xef', '0xbb', '0xbf', '0x7b', '0x22', '0x49', '0x44', '0x45', '0x4e', '0x54', '0x49', '0x46', '0x49', '0x45', '0x52', '0x22', '0x3a', '0x22', '0x41', '0x66', '0x67', '0x68', '0x61', '0x6e', '0x69', '0x73', '0x74', '0x61', '0x6e', '0x22', '0x2c', '0x22', '0x49', '0x44', '0x45', '0x4e', '0x54', '0x49', '0x46', '0x49', '0x45', '0x52', '0x5f', '0x49', '0x53', '0x4f', '0x32', '0x22', '0x3a', '0x22', '0x41', '0x46', '0x22', '0x2c', '0x22', '0x49', '0x44', '0x45', '0x4e', '0x54', '0x49', '0x46', '0x49', '0x45', '0x52', '0x5f', '0x49', '0x53', '0x4f', '0x33', '0x22', '0x3a', '0x22', '0x41', '0x46', '0x47', '0x22', '0x2c', '0x22', '0x49', '0x44', '0x45', '0x4e', '0x54', '0x49', '0x46', '0x49', '0x45', '0x52', '0x5f', '0x49', '0x53', '0x4f', '0x5f', '0x4e', '0x55', '0x4d', '0x45', '0x52', '0x49', '0x43', '0x22', '0x3a', '0x22', '0x30', '0x30', '0x34', '0x22', '0x2c', '0x22', '0x4f', '0x46', '0x46', '0x49', '0x43', '0x49', '0x41', '0x4c', '0x5f', '0x53', '0x48', '0x4f', '0x52', '0x54', '0x5f', '0x49', '0x44', '0x45'])

标签: pythonpandas

解决方案


它看起来像是旧版 Pandas 中的一个错误。使用在 中编码的最小 JsonL 字节串 utf-8-sig bb,我尝试了:

pd.read_json(io.BytesIO(bb), lines=True, encoding='utf-8-sig') (1)
pd.read_json(io.StringIO(bb.decode('utf-8-sig')), lines=True)  (2)

两者都可以在 Python 3.8 Pandas 1.2.2 上正常工作,但在 Python 3.6 Pandas 1.0.3 (2) 上工作正常但 (1) 会引发ValueError: Expected object or value

这意味着解决方法很简单:在 Python 级别解码您的字节字符串并read_json使用 unicode 字符串提供:

...
downloaded_bytes = download.readall()
df = pd.read_json(StringIO(downloaded_bytes.decode('utf-8-sig')),lines = True)
display(df) 

推荐阅读