python - 排查 CSV 中的无效行
问题描述
我正在处理一个非常大的 CSV 文件(将近 6 GB),而且它绝对充满了错误。例如,如果我有以下 csv 文件/表:
+------------+-------------+------------+
| ID | Date | String |
+------------+-------------+------------+
| 123456 | 09-20-2019 | ABCDEFG |
| 123abc456 | 10-30-2019 | HIJKLMN |
| 7891011 | jdqhouehwf | OPQRSTU |
| 1010101 | 03-15-2018 | 8473737 |
| 4823.00 | 02-11-2015 | VWXYZ |
| 2348813.0 | 01-23-2016 | BAZ |
+------------+-------------+------------+
或者:
"ID","Date","String"
123456,"09-20-2019","ABCDEFG"
123abc456,"10-30-2019","HIJKLMN"
7891011,"jdqhouehwf","OPQRSTU"
1010101,"03-15-2018",8473737
4823.00,"02-11-2015","VWXYZ"
"2348813.0","01-23-2016","BAZ"
我想要一个解决和修复文件的好方法。使用熊猫,我可以在文件中读取:
import pandas as pd
df = pd.read_csv(inputfile)
熊猫总是会抱怨:
sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False
所以我想清理每一列。但由于它是一个非常大的文件,我不能只打印我的整个表格以使用掩码输出并期望读取它。我想要一种简单的方法来获取一列并检查它是否符合类型。另外,如果可能的话,我想要一种删除坏行和/或将行转换为正确格式的方法。毕竟,我希望文件看起来像(不包括内联注释):
"ID","Date","String"
123456,"09-20-2019","ABCDEFG"
# 123abc456,"10-30-2019","HIJKLMN" was deleted because the ID wasn't a number
# 7891011,"jdqhouehwf","OPQRSTU" was deleted because the data was not a date
1010101,"03-15-2018","8473737" # The last number could be converted to string
4823,"02-11-2015","VWXYZ" # The first number could be converted to integer
2348813,"01-23-2016","BAZ" # The ID number could be converted to int
解决方案
def main():
from pathlib import Path
import csv
import datetime as dt
with Path("thing.csv").open("r") as file:
for row in csv.DictReader(file):
try:
row["ID"] = int(float(row["ID"]))
row["Date"] = dt.datetime.strptime(row["Date"], "%m-%d-%Y")
except (KeyError, ValueError):
continue
print(*row.values())
return 0
if __name__ == "__main__":
import sys
sys.exit(main())
推荐阅读
- php - 致命错误:使用 go daddy 服务器上的 composer 耗尽允许的 1610612736 字节内存大小(尝试分配 83886080 字节)
- typescript - 打字稿路由器道具对象
- flutter - “Null”类型不是“Future”类型的子类型
' 使用 Mocktail 测试模拟的 http 客户端时 - angular - 本地存储,无法读取未定义的属性“platformId”
- arrays - 其元素指向另一个指针数组的指针数组
- flutter - 如何让 Transform.translate 工作 Flutter
- php - 在使用“woe_order_export_started”挂钩时确定 WooCommerce 订单是否包含订单备注
- python - 我有这个 LDA 代码,当我运行它时,我不断收到一个难以跟踪的错误
- r - R 基础,管道,用 max() 总结
- java - Preety - 打印 json 并转换为 pdf