python - 尝试使用 python 删除 csv 文件中的额外分隔符时,文本限定符放错了位置
问题描述
我正在尝试使用 python 脚本删除数据之间的额外分隔符。我通常使用大型数据集。例如:
"abc","def","ghi","jkl","mno","pqr"
"","","fds","dfs","adfadf","AAAA111"
"","","fds","df,s","adfadf","AAAA111"
如果我运行脚本,脚本将删除第 2 行“df,s”中的额外分隔符:
"abc","def","ghi","jkl","mno","pqr"
"","","fds","dfs","adfadf","AAAA111"
"","","fds","dfs","adfadf","AAAA111"
我能够为一种数据类型正确运行脚本,但我注意到对于少数文本限定符数据,文本限定符放错了位置,结果如下所示:
"abc","def","ghi","jkl","mno","pqr"
"""","""""""""","""""fds""""","""""dfs""""","""""adfadf""""","AAAA111""""
"""","""""""""","""""fds""""","""""dfs""""","""""adfadf""""","AAAA111""""
脚本是:
#export the data
# with correct quoting, and that you are stuck with what you have.
import csv
from csv import DictWriter
with open("big-12.csv", newline='') as people_file:
next(people_file)
corrected_people = []
for person_line in people_file:
chomped_person_line = person_line.rstrip()
person_tokens = chomped_person_line.split(",")
# check that each field has the expected type
try:
corrected_person = {
"abc":person_tokens[0],
"def":person_tokens[1],
"ghi":person_tokens[2],
"jkl":"".join(person_tokens[3:-3]),
"mno":person_tokens[-2],
"pqr":person_tokens[-1]
}
if not corrected_person["DR_CR"].startswith(
"") and corrected_person["DR_CR"] !="n/a":
raise ValueError
corrected_people.append(corrected_person)
except (IndexError, ValueError):
# print the ignored lines, so manual correction can be performed later.
print("Could not parse line: " + chomped_person_line)
with open("corrected_people.txt", "w", newline='') as corrected_people_file:
writer = DictWriter(
corrected_people_file,
fieldnames=[
"abc", "def", "ghi", "jkl", "mno", "pqr"
],delimiter=',',quoting=csv.QUOTE_ALL)
writer.writeheader()
writer.writerows(corrected_people)
该脚本正在删除中间的额外分隔符,但我在使用文本限定符时遇到了问题。如果文本限定符问题是旋转的,那么它将有很大帮助。Python 版本 Python 3.6.0 :: Anaconda 4.3.1(64 位)
解决方案
writer = DictWriter(
corrected_people_file,
fieldnames=[
"abc", "def", "ghi", "jkl", "mno", "pqr"
],delimiter=',',quoting=csv.QUOTE_ALL)
QUOTE_ALL
将强制引用所有字段,并且现有的双引号将用另一个双引号转义。
所以尝试使用QUOTE_NONE
or QUOTE_MINIMAL
,或者在写作之前去掉引号的字段。
我在使用文本限定符时遇到问题
此外,引用字段并不意味着这些是文本与数字,引号仅用于允许嵌入分隔符,并且也可以围绕数字字段。
一般来说,使用 csv 阅读器而不是使用split()
. 使用 csv 阅读器,该字段"df,s"
将被正确读取,因为它已被引用。然后,您可以,
从该单个字段中删除 。
推荐阅读
- javascript - HTML
- 在浏览器上标记自我克隆
- javascript - 我可以优化因 Raserize Paint 过程而减慢的动画(如开发工具中所见)?
- python - 用于二维数组的 Python Numpy Linspace 函数
- database - Laravel 在数据库上执行计划作业
- javascript - 基于值的jquery字体颜色脚本包含
- asp.net-mvc - 如何发布通过洋葱架构开发并拥有多个项目的 Web 应用程序
- java - CustomRepository 与 org.springframework.data.repository.CrudRepository 返回类型中的 save(S) 冲突
java.lang.Long 与 S 不兼容 - bash - 如何延迟BASH`>`的`重定向运算符`
- azure - Azure AD:使用客户端详细信息获取租户 ID
- c# - 如何从给定的语言名称中获取语言文化?