python - 使用python错误从大csv转换为镶木地板
问题描述
我有大约有 200 多列和 1 百万多行的 csv 文件。当我从 csv 转换为 python 时,我遇到了错误:
csv_file = 'bigcut.csv'
chunksize = 100_000
parquet_file ='output.parquet'
parser=argparse.ArgumentParser(description='Process Arguments')
parser.add_argument("--fname",action="store",default="",help="specify <run/update>")
args=parser.parse_args()
argFname=args.__dict__["fname"]
csv_file=argFname
csv_stream = pd.read_csv(csv_file, encoding = 'utf-8',sep=',', >chunksize=chunksize, low_memory=False)
for i, chunk in enumerate(csv_stream):
print("Chunk", i)
if i==0:
parquet_schema = pa.Table.from_pandas(df=chunk).schema
parquet_writer = pq.ParquetWriter(parquet_file, parquet_schema, compression='snappy')
table = pa.Table.from_pandas(chunk, schema=parquet_schema)
parquet_writer.write_table(table)
parquet_writer.close()
当我运行时,它会产生以下错误
File "pyconv.py", line 25, in <module>
table = pa.Table.from_pandas(chunk, schema=parquet_schema)
File "pyarrow/table.pxi", line 1217, in pyarrow.lib.Table.from_pandas
File "/home/cloud-user/pydev/py36-venv/lib64/python3.6/site-packages/pyarrow/pandas_compat.py", line 387, in dataframe_to_arrays
convert_types))
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/_base.py", line 586, in result_iterator
yield fs.pop().result()
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/_base.py", line 432, in result
return self.__get_result()
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/_base.py", line 384, in __get_result
raise self._exception
File "/opt/rh/rh-python36/root/usr/lib64/python3.6/concurrent/futures/thread.py", line 56, in run
result = self.fn(*self.args, **self.kwargs)
File "/home/cloud-user/pydev/py36-venv/lib64/python3.6/site-packages/pyarrow/pandas_compat.py", line 376, in convert_column
raise e
File "/home/cloud-user/pydev/py36-venv/lib64/python3.6/site-packages/pyarrow/pandas_compat.py", line 370, in convert_column
return pa.array(col, type=ty, from_pandas=True, safe=safe)
File "pyarrow/array.pxi", line 169, in pyarrow.lib.array
File "pyarrow/array.pxi", line 69, in pyarrow.lib._ndarray_to_array
File "pyarrow/error.pxi", line 81, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: ("'utf-32-le' codec can't decode bytes in position 0-3: code point not in range(0x110000)", 'Conversion failed for column agent_number__c with type float64')
我是新的 pandas/pyarrow/python,如果有人有任何建议,我应该在调试旁边做什么。
解决方案
csv 有大约 3mils 的记录。我设法抓住了 1 个潜在问题。
列的 1 具有字符串/文本的数据类型。不知何故,其中大多数是数字,但其中一些与文本混合,例如其中许多是 1000,230,400 等,但很少有人像 5k、100k、29k 那样输入。
所以代码不知何故不喜欢它尝试设置为数字/整数。
你能建议吗?
推荐阅读
- php - 当在表中找不到条目时,Laravel 未定义变量
- python - send_keys(Keys.RETURN) 抛出错误
- python - 如何将 PostgreSQL 数据库自动备份到驱动器
- c# - 统一使用没有 ECS 的突发编译器
- ios - Google Advance 原生广告展示问题
- certificate - 已签名 VBA 项目的 Microsoft Word 安全警告
- javascript - JS游戏未在网页中呈现
- r - 使用 Pad 功能填充日期时出现问题
- windows - 如何在 Windows 操作系统中包含原生 macOS 字体,例如 Times Roman 和 Helvetica?
- flutter - 推送新屏幕时保留BottomAppBar