首页 > 解决方案 > 排查 CSV 中的无效行

问题描述

我正在处理一个非常大的 CSV 文件(将近 6 GB),而且它绝对充满了错误。例如,如果我有以下 csv 文件/表:

+------------+-------------+------------+
|     ID     |    Date     |   String   |
+------------+-------------+------------+
|  123456    |  09-20-2019 |   ABCDEFG  |
|  123abc456 |  10-30-2019 |   HIJKLMN  |
|  7891011   |  jdqhouehwf |   OPQRSTU  |
|  1010101   |  03-15-2018 |   8473737  |
|  4823.00   |  02-11-2015 |   VWXYZ    |
|  2348813.0 |  01-23-2016 |   BAZ      |
+------------+-------------+------------+

或者:

"ID","Date","String"
123456,"09-20-2019","ABCDEFG"
123abc456,"10-30-2019","HIJKLMN"
7891011,"jdqhouehwf","OPQRSTU"
1010101,"03-15-2018",8473737
4823.00,"02-11-2015","VWXYZ"
"2348813.0","01-23-2016","BAZ"

我想要一个解决和修复文件的好方法。使用熊猫,我可以在文件中读取:

import pandas as pd

df = pd.read_csv(inputfile)

熊猫总是会抱怨: sys:1: DtypeWarning: Columns (0) have mixed types. Specify dtype option on import or set low_memory=False

所以我想清理每一列。但由于它是一个非常大的文件,我不能只打印我的整个表格以使用掩码输出并期望读取它。我想要一种简单的方法来获取一列并检查它是否符合类型。另外,如果可能的话,我想要一种删除坏行和/或将行转换为正确格式的方法。毕竟,我希望文件看起来像(不包括内联注释):

"ID","Date","String"
123456,"09-20-2019","ABCDEFG"
#  123abc456,"10-30-2019","HIJKLMN" was deleted because the ID wasn't a number
#  7891011,"jdqhouehwf","OPQRSTU" was deleted because the data was not a date
1010101,"03-15-2018","8473737" # The last number could be converted to string
4823,"02-11-2015","VWXYZ" # The first number could be converted to integer
2348813,"01-23-2016","BAZ" # The ID number could be converted to int

标签: pythonpandascsvsedlarge-data

解决方案


def main():

    from pathlib import Path
    import csv
    import datetime as dt

    with Path("thing.csv").open("r") as file:
        for row in csv.DictReader(file):
            try:
                row["ID"] = int(float(row["ID"]))
                row["Date"] = dt.datetime.strptime(row["Date"], "%m-%d-%Y")
            except (KeyError, ValueError):
                continue
            print(*row.values())

    return 0


if __name__ == "__main__":
    import sys
    sys.exit(main())

推荐阅读