首页 > 解决方案 > 为每一行读取具有不同语法的表

问题描述

我正在尝试做一些简单的事情,比如读取 .csv 文件。

该文件包含以 3 种不同方式写入的行。

"00001","Some text - and a number - 99999999","2020-01-01","Some more text","9999"
"00002",""Some text, with a separator and a number - 9999999"","2020-01-01","Some more text","9999"
"00003","Some text (and a number) - 99999999","2020-01-01",""Again, a separator"","9999"
"00013",""Comp, which does something - 999999999"","2020-01-01","Standard rgb","r k å g","","","","15","Fak E","XXX","XXX","2020-01-01","E","","","EKT","0.00","15.26","E","Å","","0","9","52"

正在做,

pd.read_csv(path, encoding='UTF-8-sig', header=None)

结果如下表。我希望所有行都像第一行一样。

在此处输入图像描述

我也尝试过提供quotechar='"'. 我对如何编写这个 .csv 文件没有任何发言权,用 Excel 或 N++ 重写不是一个合适的解决方案。由于真实文件中的斯堪的纳维亚字母,我使用编码 UTF-8-sig,但无论如何删除都不会做任何事情。

有人可以帮忙吗?

标签: pythonpython-3.xpandascsv

解决方案


在喂给熊猫之前进行预处理。使用正则表达式将双引号替换为单引号。

import re
import pandas as pd

#simulate a file  
s = '''"00001","Some text - and a number - 99999999","2020-01-01","Some more text","9999"
"00002",""Some text, with a separator and a number - 9999999"","2020-01-01","Some more text","9999"
"00003","Some text (and a number) - 99999999","2020-01-01",""Again, a separator"","9999"'''

f = io.StringIO(s)

pattern = r'""'
# in memory "file" instead of writing the change to disk
g = io.StringIO(re.sub(pattern,'"',f.read()))
df = pd.read_csv(g, quotechar='"', header=None, dtype=object, encoding='UTF-8-sig')

如果文件是test.csv并且您不想/不需要将修改后的数据写回磁盘。

import io
import re
import pandas as pd

pattern = r'""'
with open('test.csv') as f:
    g = io.StringIO(re.sub(pattern,'"',f.read()))

df = pd.read_csv(g, quotechar='"', header=None, dtype=object, encoding='UTF-8-sig')

如果您想在使用 pandas 之前修改文件并将其写入磁盘。

import io
import re
import pandas as pd

pattern = r'""'
with open('test.csv') as f, open('modified.csv','w') as g:
    g.write(re.sub(pattern,'"',f.read()))

df = pd.read_csv('modified.csv', quotechar='"', header=None, dtype=object, encoding='UTF-8-sig')

如果test.csv看起来像

"00013",""Comp, which does something - 999999999"","2020-01-01","Standard rgb","r k å g","","","","15","Fak E","XXX","XXX","2020-01-01","E","","","EKT","0.00","15.26","E","Å","","0","9","52"

然后我只能看到一种通过解析两次来规范化它的方法 - 首先用双引号空字符串替换带引号的空字符串,然后用单引号替换所有双引号。

pattern1 = r',("")(?=,)'
pattern2 = r'""'

with open('test.csv') as f:
    data = re.sub(pattern1,',"\1"',f.read())
    g = io.StringIO(re.sub(pattern2,',"\1",',data))

df = pd.read_csv(g, quotechar='"', header=None, dtype=object, encoding='UTF-8-sig')

它以那些奇怪的空字符串字符结束 -

>>> print(df.to_string())
       0    1  2     3                                  4  5    6           7             8         9  10 11 12  13     14   15   16          17 18 19 20   21    22     23 24  25 26 27 28  29
0  00013"  NaN    Comp   which does something - 999999999    NaN  2020-01-01  Standard rgb  r k å g        15  Fak E  XXX  XXX  2020-01-01  E      EKT  0.00  15.26  E  Å    0  9  52
>>>

也许有人可以想出一个更好的正则表达式模式。


推荐阅读