首页 > 解决方案 > 使用 tokenize.detect_encoding() 时出现“TypeError:对象不可调用”

问题描述

我正在阅读一堆 txt.gz 文件,但它们具有不同的编码(至少 UTF-8 和 cp1252,它们是旧的脏文件)。我尝试fIn在以文本模式读取它之前检测它的编码,但出现错误:TypeError: 'GzipFile' object is not callable

对应代码:

   # detect encoding
   with gzip.open(fIn,'rb') as file:
        fInEncoding = tokenize.detect_encoding(file) #this doesn't works
        print(fInEncoding)

    for line in gzip.open(fIn,'rt', encoding=fInEncoding[0], errors="surrogateescape"):
        if line.find("From ") == 0:
            if lineNum != 0:
                out.write("\n")
            lineNum +=1
            line = line.replace(" at ", "@")
        out.write(line)

追溯

$ ./mailmanToMBox.py list-cryptography.metzdowd.com
 ('Converting ', '2015-May.txt.gz', ' to mbox format')
 Traceback (most recent call last):
  File "./mailmanToMBox.py", line 65, in <module>
    main()
  File "./mailmanToMBox.py", line 27, in main
    if not makeMBox(inFile,outFile):
  File "./mailmanToMBox.py", line 48, in makeMBox
    fInEncoding = tokenize.detect_encoding(file.readline()) #this doesn't works                                                         
  File "/Users/simon/anaconda3/lib/python3.6/tokenize.py", line 423, in detect_encoding                                                 
    first = read_or_stop()
  File "/Users/simon/anaconda3/lib/python3.6/tokenize.py", line 381, in read_or_stop                                                    
    return readline()
 TypeError: 'bytes' object is not callable

编辑我尝试使用以下代码:

# detect encoding
readsource =  gzip.open(fIn,'rb').__next__
fInEncoding = tokenize.detect_encoding(readsource)
print(fInEncoding)

我没有错误,但即使不是,它也总是返回 utf-8。我的文本编辑器(崇高)正确检测到 cp1252 编码。

标签: pythonpython-3.x

解决方案


正如文档detect_encoding() 所说,它的输入参数必须是提供输入行的可调用文件。这就是为什么你得到一个TypeError: 'GzipFile' object is not callable.

import tokenize

with open(fIn, 'rb') as f:
    codec = tokenize.detect_encoding(f.readline)[0]

...codec将是“utf-8”或类似的东西。


推荐阅读