python - Pandas read_json() encoding = 'utf-8-sig' 选项不适用于 BytesIO 对象(类文件对象)
问题描述
当尝试将 UTF8-BOM 编码的jsonlines文件作为字节数据直接加载到 pandas 数据帧中时,出现错误“ValueError”对象没有属性“消息”(当编码不同时会发生此一般错误)。我正在尝试使用 azure.storage.filedatalake.DataLakeFileClient 从 Azure Datalake Gen-2 读取数据,它为我提供字节数据,我正在尝试直接将该数据加载到 pandas 数据帧中。下面给出了失败的代码片段
from azure.identity import ClientSecretCredential
from azure.storage.filedatalake import DataLakeServiceClient
from io import BytesIO,StringIO
def initialize_storage_account_ad(storage_account_name, client_id, client_secret, tenant_id):
try:
global service_client
credential = ClientSecretCredential(tenant_id, client_id, client_secret)
service_client = DataLakeServiceClient(account_url="{}://{}.dfs.core.windows.net".format(
"https", storage_account_name), credential=credential)
except Exception as e:
print(e.message)
initialize_storage_account_ad(storage_account_name, client_id, client_secret, tenant_id)
data_folder = '/raw/data/'
file_system_client = service_client.get_file_system_client(file_system="dls")
paths = file_system_client.get_paths(path=data_folder)
directory_client = file_system_client.get_directory_client(data_folder)
file_client = directory_client.get_file_client('API_COUNTRY.json')
download = file_client.download_file()
downloaded_bytes = download.readall()
df = pd.read_json(BytesIO(downloaded_bytes),lines = True,encoding = 'utf-8-sig')
display(df)
如果我尝试使用 UTF-8 编码,同样的代码也可以工作,如果我将 UTF8-BOM jsonlines 写入文件并使用它加载它,df = pd.read_json('country.json',lines = True,encoding = 'utf-8-sig')
那么它也可以工作。任何帮助是极大的赞赏。
错误堆栈跟踪
ValueError Traceback (most recent call last)
<ipython-input-13-b150d9150c5a> in <module>
31
32 downloaded_bytes = download.readall()
---> 33 df = pd.read_json(BytesIO(downloaded_bytes),lines = True,encoding = 'utf-8-sig')
34 display(df)
C:\Program Files\Python36\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
197 else:
198 kwargs[new_arg_name] = new_arg_value
--> 199 return func(*args, **kwargs)
200
201 return cast(F, wrapper)
C:\Program Files\Python36\lib\site-packages\pandas\util\_decorators.py in wrapper(*args, **kwargs)
294 )
295 warnings.warn(msg, FutureWarning, stacklevel=stacklevel)
--> 296 return func(*args, **kwargs)
297
298 return wrapper
C:\Program Files\Python36\lib\site-packages\pandas\io\json\_json.py in read_json(path_or_buf, orient, typ, dtype, convert_axes, convert_dates, keep_default_dates, numpy, precise_float, date_unit, encoding, lines, chunksize, compression, nrows)
616 return json_reader
617
--> 618 result = json_reader.read()
619 if should_close:
620 filepath_or_buffer.close()
C:\Program Files\Python36\lib\site-packages\pandas\io\json\_json.py in read(self)
751 data = ensure_str(self.data)
752 data = data.split("\n")
--> 753 obj = self._get_object_parser(self._combine_lines(data))
754 else:
755 obj = self._get_object_parser(self.data)
C:\Program Files\Python36\lib\site-packages\pandas\io\json\_json.py in _get_object_parser(self, json)
775 obj = None
776 if typ == "frame":
--> 777 obj = FrameParser(json, **kwargs).parse()
778
779 if typ == "series" or obj is None:
C:\Program Files\Python36\lib\site-packages\pandas\io\json\_json.py in parse(self)
884
885 else:
--> 886 self._parse_no_numpy()
887
888 if self.obj is None:
C:\Program Files\Python36\lib\site-packages\pandas\io\json\_json.py in _parse_no_numpy(self)
1117 if orient == "columns":
1118 self.obj = DataFrame(
-> 1119 loads(json, precise_float=self.precise_float), dtype=None
1120 )
1121 elif orient == "split":
ValueError: Expected object or value
字节值的开头:
('b', ['0xef', '0xbb', '0xbf', '0x7b', '0x22', '0x49', '0x44', '0x45', '0x4e', '0x54', '0x49', '0x46', '0x49', '0x45', '0x52', '0x22', '0x3a', '0x22', '0x41', '0x66', '0x67', '0x68', '0x61', '0x6e', '0x69', '0x73', '0x74', '0x61', '0x6e', '0x22', '0x2c', '0x22', '0x49', '0x44', '0x45', '0x4e', '0x54', '0x49', '0x46', '0x49', '0x45', '0x52', '0x5f', '0x49', '0x53', '0x4f', '0x32', '0x22', '0x3a', '0x22', '0x41', '0x46', '0x22', '0x2c', '0x22', '0x49', '0x44', '0x45', '0x4e', '0x54', '0x49', '0x46', '0x49', '0x45', '0x52', '0x5f', '0x49', '0x53', '0x4f', '0x33', '0x22', '0x3a', '0x22', '0x41', '0x46', '0x47', '0x22', '0x2c', '0x22', '0x49', '0x44', '0x45', '0x4e', '0x54', '0x49', '0x46', '0x49', '0x45', '0x52', '0x5f', '0x49', '0x53', '0x4f', '0x5f', '0x4e', '0x55', '0x4d', '0x45', '0x52', '0x49', '0x43', '0x22', '0x3a', '0x22', '0x30', '0x30', '0x34', '0x22', '0x2c', '0x22', '0x4f', '0x46', '0x46', '0x49', '0x43', '0x49', '0x41', '0x4c', '0x5f', '0x53', '0x48', '0x4f', '0x52', '0x54', '0x5f', '0x49', '0x44', '0x45'])
解决方案
它看起来像是旧版 Pandas 中的一个错误。使用在 中编码的最小 JsonL 字节串 utf-8-sig bb
,我尝试了:
pd.read_json(io.BytesIO(bb), lines=True, encoding='utf-8-sig') (1)
pd.read_json(io.StringIO(bb.decode('utf-8-sig')), lines=True) (2)
两者都可以在 Python 3.8 Pandas 1.2.2 上正常工作,但在 Python 3.6 Pandas 1.0.3 (2) 上工作正常但 (1) 会引发ValueError: Expected object or value
这意味着解决方法很简单:在 Python 级别解码您的字节字符串并read_json
使用 unicode 字符串提供:
...
downloaded_bytes = download.readall()
df = pd.read_json(StringIO(downloaded_bytes.decode('utf-8-sig')),lines = True)
display(df)
推荐阅读
- swift - 斯威夫特:交易真的是在 Firebase 中存储赞的最有效方式吗?
- firebase - Firestore startAfter(),如何知道何时不再加载
- python - 熊猫explode()函数的考拉等价物是什么?
- javascript - Javascript - TypeError:IntersectionObserver.observe 的参数 1 不是对象
- c# - 来自 URL 的 XML - 根级别的数据无效。第 1 行,位置 1 为什么它适用于一个 URL 而不是另一个?
- javascript - 如何使用 Yarn 在集群模式下使用 pm2 启动 next.js 应用程序?
- javascript - JavaScript 按数组的所有值过滤
- android-studio - android崩溃错误:java.lang.IllegalStateException
- amazon-web-services - 关于 aws athena 中伯努利样本大小的问题
- java - ExecutorService 不起作用,但单独创建线程起作用