首页 > 解决方案 > Make a new list in CSV without using pandas: return UnicodeDecodeError

问题描述

I am trying to make a new list in my existing csv file (not using pandas). Here is my code:

with open ('/Users/Weindependent/Desktop/dataset/albumlist.csv','r') as case0:
    reader = csv.DictReader(case0)
    album = []
    for row in reader:
        album.append(row)
print ("Number of albums is:",len(album))

The CSV file was downloaded from the Rolling Stone's Top 500 albums data set on data.world.

My logic is to create an empty list named album and have all the records in this list. But it seems the line of for row in reader has some issue.

the error message I got is:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 1040: invalid continuation byte

Can anyone let me know what I did wrong?

标签: pythonpython-3.xcsvencoding

解决方案


You need to open the file in the correct codec; UTF-8 is not the correct one. The dataset doesn't specify it, but I have determined that the most likely codec is mac_roman:

with open ('/Users/Weindependent/Desktop/dataset/albumlist.csv', 'r', encoding='mac_roman') as case0:

The original Kaggle dataset doesn't bother to document it, and the various kernels that use the set all just clobber the encoding. It's clearly a 8-bit Latin-variant (the majority of the data is ASCII with a few individual 8-bit codepoints).

So I analysed the data, and found there are just two such codepoints in 9 rows:

>>> import re
>>> eightbit = re.compile(rb'[\x80-\xff]')
>>> with open('albumlist.csv', 'rb') as bindata:
...     nonascii = [l for l in bindata if eightbit.search(l)]
...
>>> len(nonascii)
9
>>> {c for l in nonascii for c in eightbit.findall(l)}
{b'\x89', b'\xca'}

The 0x89 byte appears in just one line:

>>> sum(l.count(b'\x89') for l in nonascii)
1
>>> sum(l.count(b'\xca') for l in nonascii)
22
>>> next(l for l in nonascii if b'\x89' in l)
b'359,1972,Honky Ch\x89teau,Elton John,Rock,"Pop Rock,\xcaClassic Rock"\r\n'

That's clearly Elton John's 1972 Honky Château album, so the 0x89 byte must represent the U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX codepoint.

The 0xCA bytes all appear to represent an alternative space character, they all appear righ after commas in the genre and subgenre columns (with one album exception):

>>> import csv
>>> for row in csv.reader((l.decode('ascii', 'backslashreplace') for l in nonascii)):
...     for col in row:
...         if '\\' in col: print(col)
...
Reggae,\xcaPop,\xcaFolk, World, & Country,\xcaStage & Screen
Reggae,\xcaRoots Reggae,\xcaRocksteady,\xcaContemporary,\xcaSoundtrack
Electronic,\xcaStage & Screen
Soundtrack,\xcaDisco
Rock,\xcaBlues
Blues Rock,\xcaElectric Blues,\xcaHarmonica Blues
Garage Rock,\xcaPsychedelic Rock
Honky Ch\x89teau
Pop Rock,\xcaClassic Rock
Funk / Soul,\xcaFolk, World, & Country
Rock,\xcaPop
Stan Getz\xca/\xcaJoao Gilberto\xcafeaturing\xcaAntonio Carlos Jobim
Bossa Nova,\xcaLatin Jazz
Lo-Fi,\xcaIndie Rock

These 0xCA bytes are almost certainly representing the U+00A0 NO-BREAK SPACE codepoint.

With these two mappings, you can try to determine what 8-bit codecs would make the same mapping. Rather than manually try out all Python's codecs I used Tripleee's 8-bit codec mapping to see what codecs use these mappings. There are only two:

  • 0x89

â‎ (U+00E2): mac_arabic, mac_croatian, mac_farsi, mac_greek, mac_iceland, mac_roman, mac_romanian, mac_turkish

  • 0xca

    ‎ (U+00A0): mac_centeuro, mac_croatian, mac_cyrillic, mac_greek, mac_iceland, mac_latin2, mac_roman, mac_romanian, mac_turkish

There are 6 encodings that are listed in both sets:

>>> set1 = set('mac_arabic, mac_croatian, mac_farsi, mac_greek, mac_iceland, mac_roman, mac_romanian, mac_turkish'.split(', '))
>>> set2 = set('mac_centeuro, mac_croatian, mac_cyrillic, mac_greek, mac_iceland, mac_latin2, mac_roman, mac_romanian, mac_turkish'.split(', '))
>>> set1 & set2
{'mac_turkish', 'mac_iceland', 'mac_romanian', 'mac_greek', 'mac_croatian', 'mac_roman'}

Of these, Mac OS Roman mac_roman codec is probably the most likely to have been used as Microsoft Excel for Mac used Mac Roman to create CSV files for a long time. However, it doesn't really matter, any of those 6 would work here.

You may want to replace those U+00A0 non-breaking spaces if you want to split out the genre and subgenre columns (really the genre and style columns if these were taken from Discogs).


推荐阅读