python - 如何通过处理连续换行符等情况来读取csv文件?
问题描述
for row in codecs.getreader(self.encoding)(self.response[u'Body']).readlines():
row_string = StringIO(row)
print ("Row read from the data is: ")
print (row_string.getvalue())
df = pd.read_csv(row_string, sep=",")
我已经编写了上面的代码来逐行从 S3 流式传输 csv 文件。但是,csv 文件中有一行,其中一行中有输入。当文件在本地下载时,Pandas 能够读取它,但在上面的代码中它会产生错误:
[2018-11-12 14:11:45,586] {models.py:1595} ERROR - Error tokenizing data. C error: EOF inside string starting at line 0
正如您在我的代码中看到的那样,忽略上面的第 0 行注释,我读取了一行并形成了它的数据框。
完整的错误回溯是:
[2018-11-12 14:11:45,586]
{models.py:1595} ERROR - Error tokenizing data. C error: EOF inside
string starting at line 0 Traceback (most recent call last): File
"/usr/local/lib/python3.5/dist-packages/airflow/models.py", line 1493,
in _run_raw_task
result = task_copy.execute(context=context) File "/usr/local/lib/python3.5/dist-packages/airflow/operators/python_operator.py",
line 89, in execute
return_value = self.execute_callable() File "/usr/local/lib/python3.5/dist-packages/airflow/operators/python_operator.py",
line 94, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs) File
"/usr/local/lib/python3.5/dist-packages/pallet-0.0.0-py3.5.egg/pallet/tasks/versionator.py", line 228, in driver_de_versionator
a.index_patch() File "/usr/local/lib/python3.5/dist-packages/pallet-0.0.0-py3.5.egg/pallet/tasks/versionator.py", line 202, in index_patch
DB.process(self.form_candidate_version, self.destination_of_kch_file_to_be_downloaded) File
"/usr/local/lib/python3.5/dist-packages/pallet-0.0.0-py3.5.egg/pallet/tasks/versionator.py", line 144, in form_candidate_version
df = pd.read_csv(row_string, sep=",") File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line
678, in parser_f
return _read(filepath_or_buffer, kwds) File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line
440, in _read
parser = TextFileReader(filepath_or_buffer, **kwds) File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line
787, in __init__
self._make_engine(self.engine) File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line
1014, in _make_engine
self._engine = CParserWrapper(self.f, **self.options) File "/usr/local/lib/python3.5/dist-packages/pandas/io/parsers.py", line
1708, in __init__
self._reader = parsers.TextReader(src, **kwds) File "pandas/_libs/parsers.pyx", line 539, in
pandas._libs.parsers.TextReader.__cinit__ File
"pandas/_libs/parsers.pyx", line 737, in
pandas._libs.parsers.TextReader._get_header File
"pandas/_libs/parsers.pyx", line 932, in
pandas._libs.parsers.TextReader._tokenize_rows File
"pandas/_libs/parsers.pyx", line 2112, in
pandas._libs.parsers.raise_parser_error pandas.errors.ParserError:
Error tokenizing data. C error: EOF inside string starting at line 0
解决方案
一个快速的谷歌你的痛苦出现了这个链接,这表明:
engine='python'
解决方案是在read_csv
函数调用中使用参数。Pandas CSV 解析器可以使用两种不同的“引擎”来解析 CSV 文件——Python 或 C(默认)。
pandas.read_csv(filepath_or_buffer, sep=', ', delimiter=None,
header='infer', names=None,
index_col=None, usecols=None, squeeze=False,
..., engine=None, ...)
引擎:{'c','python'},可选
要使用的解析器引擎。C 引擎更快,而 python 引擎目前功能更完整。
推荐阅读
- terraform - Terraform:通过模块变量设置 AWS 资源的提供者值
- python - 如何在 Azure ML 工作区的一个 pickle 文件中转储和利用多个 ML 算法对象?
- c++ - 使用 cout 函数、for 和 if-else 语句的 C++ 中的锯齿形模式
- arrays - 附上正确的数据格式绘制图表
- mysql - 哪个查询更快?为什么?
- javascript - Angular 应用升级到版本 12 后构建白屏
- r - 如何将音频 (.wav) 文件导入 R?
- reactjs - 有更新时,我的服务人员通知对话框未显示
- typescript - 使用 React Typescript 飞过太空星星动画的 HTML5 画布
- javascript - jQuery jqGrid 错误:语法错误:JSON 中位置 0 的意外标记 I