首页 > 解决方案 > Python 2.7 CSV 文件读/写 \xef\xbb\xbf 代码

问题描述

我有一个关于 Python 2.7 用 ' utf-8-sig' 代码读/写 csv 文件的问题,我的 csv 。标题是

['\xef\xbb\xbfID;timestamp;CustomerID;Email']

"\xef\xbb\xbfID"我从文件中读取了一些代码( ) A.csv,我想将相同的代码和标题写入文件B.csv

我的打印日志显示:

['\xef\xbb\xbfID;timestamp;CustomerID;Email']

但实际的输出文件头看起来像

ÔªøID;timestamp

在此处输入图像描述

这是代码:

def remove_gdpr_info_from_csv(file_path, file_name, temp_folder, original_header):
    new_temp_folder = tempfile.mkdtemp()
    new_temp_file = new_temp_folder + "/" + file_name
    # Blanked new file
    with open(new_temp_file, 'wb') as outfile:
        writer = csv.writer(outfile, delimiter=";")
        print original_header
        writer.writerow(original_header)
        # File from SFTP
        with open(file_path, 'r') as infile:
            reader = csv.reader(infile, delimiter=";")
            first_row = next(reader)
            email = first_row.index('Email')
            contract_detractor1 = first_row.index('Contact Detractor (Q21)')
            contract_detractor2 = first_row.index('Contact Detractor (Q20)')
            contract_detractor3 = first_row.index('Contact Detractor (Q43)')
            contract_detractor4 = first_row.index('Contact Detractor(Q26)')
            contract_detractor5 = first_row.index('Contact Detractor(Q27)')
            contract_detractor6 = first_row.index('Contact Detractor(Q44)')
            indexes = []
            for column_name in header_list:
                ind = first_row.index(column_name)
                indexes.append(ind)

            for row in reader:
                output_row = []
                for ind in indexes:
                    data = row[ind]
                    if ind == email:
                        data = ''
                    elif ind == contract_detractor1:
                        data = ''
                    elif ind == contract_detractor2:
                        data = ''
                    elif ind == contract_detractor3:
                        data = ''
                    elif ind == contract_detractor4:
                        data = ''
                    elif ind == contract_detractor5:
                        data = ''
                    elif ind == contract_detractor6:
                        data = ''
                    output_row.append(data)
                writer.writerow(output_row)
    s3core.upload_files(SPARKY_S3, DESTINATION_PATH, new_temp_file)
    shutil.rmtree(temp_folder)
    shutil.rmtree(new_temp_folder)

标签: pythonpython-2.7csvfile-writingfile-read

解决方案


'\xef\xbb\xbf'是 Unicode ZERO WIDTH NO-BREAK SPACE U+FEFF 的 UTF8 编码版本。它通常用作 Unicode 文本文件开头的字节顺序标记:

  • 当你有 3 bytes:'\xef\xbb\xbf'时,文件是 utf8 编码的
  • 当你有 2 bytes:'\xff\xfe'时,文件是 utf16 little endian
  • 当你有 2 bytes:'\xfe\xff'时,文件是 utf16 big endian

编码明确要求在'utf-8-sig'文件开头写入此 BOM

要在 Python 2 中读取 csv 文件时自动处理它,可以使用 codecs 模块:

with open(file_path, 'r') as infile:
    reader = csv.reader(codecs.EncodedFile(infile, 'utf-8', 'utf-8-sig'), delimiter=";")

EncodedFile将通过解码来包装原始文件对象,实际上跳过 BOM 并在没有 BOMutf8-sig的情况下重新编码。utf8


推荐阅读