python - 为每一行读取具有不同语法的表
问题描述
我正在尝试做一些简单的事情,比如读取 .csv 文件。
该文件包含以 3 种不同方式写入的行。
"00001","Some text - and a number - 99999999","2020-01-01","Some more text","9999"
"00002",""Some text, with a separator and a number - 9999999"","2020-01-01","Some more text","9999"
"00003","Some text (and a number) - 99999999","2020-01-01",""Again, a separator"","9999"
"00013",""Comp, which does something - 999999999"","2020-01-01","Standard rgb","r k å g","","","","15","Fak E","XXX","XXX","2020-01-01","E","","","EKT","0.00","15.26","E","Å","","0","9","52"
正在做,
pd.read_csv(path, encoding='UTF-8-sig', header=None)
结果如下表。我希望所有行都像第一行一样。
我也尝试过提供quotechar='"'
. 我对如何编写这个 .csv 文件没有任何发言权,用 Excel 或 N++ 重写不是一个合适的解决方案。由于真实文件中的斯堪的纳维亚字母,我使用编码 UTF-8-sig,但无论如何删除都不会做任何事情。
有人可以帮忙吗?
解决方案
在喂给熊猫之前进行预处理。使用正则表达式将双引号替换为单引号。
import re
import pandas as pd
#simulate a file
s = '''"00001","Some text - and a number - 99999999","2020-01-01","Some more text","9999"
"00002",""Some text, with a separator and a number - 9999999"","2020-01-01","Some more text","9999"
"00003","Some text (and a number) - 99999999","2020-01-01",""Again, a separator"","9999"'''
f = io.StringIO(s)
pattern = r'""'
# in memory "file" instead of writing the change to disk
g = io.StringIO(re.sub(pattern,'"',f.read()))
df = pd.read_csv(g, quotechar='"', header=None, dtype=object, encoding='UTF-8-sig')
如果文件是test.csv
并且您不想/不需要将修改后的数据写回磁盘。
import io
import re
import pandas as pd
pattern = r'""'
with open('test.csv') as f:
g = io.StringIO(re.sub(pattern,'"',f.read()))
df = pd.read_csv(g, quotechar='"', header=None, dtype=object, encoding='UTF-8-sig')
如果您想在使用 pandas 之前修改文件并将其写入磁盘。
import io
import re
import pandas as pd
pattern = r'""'
with open('test.csv') as f, open('modified.csv','w') as g:
g.write(re.sub(pattern,'"',f.read()))
df = pd.read_csv('modified.csv', quotechar='"', header=None, dtype=object, encoding='UTF-8-sig')
如果test.csv
看起来像
"00013",""Comp, which does something - 999999999"","2020-01-01","Standard rgb","r k å g","","","","15","Fak E","XXX","XXX","2020-01-01","E","","","EKT","0.00","15.26","E","Å","","0","9","52"
然后我只能看到一种通过解析两次来规范化它的方法 - 首先用双引号空字符串替换带引号的空字符串,然后用单引号替换所有双引号。
pattern1 = r',("")(?=,)'
pattern2 = r'""'
with open('test.csv') as f:
data = re.sub(pattern1,',"\1"',f.read())
g = io.StringIO(re.sub(pattern2,',"\1",',data))
df = pd.read_csv(g, quotechar='"', header=None, dtype=object, encoding='UTF-8-sig')
它以那些奇怪的空字符串字符结束 -
>>> print(df.to_string())
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
0 00013" NaN Comp which does something - 999999999 NaN 2020-01-01 Standard rgb r k å g 15 Fak E XXX XXX 2020-01-01 E EKT 0.00 15.26 E Å 0 9 52
>>>
也许有人可以想出一个更好的正则表达式模式。
推荐阅读
- python - 如何在字符串中一次查看多个字符?
- kubernetes - Kubernetes 仪表板,CrashLoopBackOff,用户不在命名空间“kube-system”中
- javascript - PHP变量未执行
- node.js - 如何显示从请求中收到的文件
- php - preg_split():主题太长
- angularjs-directive - Angular UI uib-typeahead 结果未显示在下拉列表中
- reactjs - 如何存储来自 fetch 的 json 数据并在 react native 上访问它?
- sql - SQL Server 返回出现不同的记录
- html - 为什么我的网站在使用所有 bootstrap 4 框架后没有响应
- mongodb - MongoDB - 设置是否为空或空查询运算符