python - Make a new list in CSV without using pandas: return UnicodeDecodeError
问题描述
I am trying to make a new list in my existing csv file (not using pandas). Here is my code:
with open ('/Users/Weindependent/Desktop/dataset/albumlist.csv','r') as case0:
reader = csv.DictReader(case0)
album = []
for row in reader:
album.append(row)
print ("Number of albums is:",len(album))
The CSV file was downloaded from the Rolling Stone's Top 500 albums data set on data.world.
My logic is to create an empty list named album and have all the records in this list. But it seems the line of for row in reader
has some issue.
the error message I got is:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xca in position 1040: invalid continuation byte
Can anyone let me know what I did wrong?
解决方案
You need to open the file in the correct codec; UTF-8 is not the correct one. The dataset doesn't specify it, but I have determined that the most likely codec is mac_roman
:
with open ('/Users/Weindependent/Desktop/dataset/albumlist.csv', 'r', encoding='mac_roman') as case0:
The original Kaggle dataset doesn't bother to document it, and the various kernels that use the set all just clobber the encoding. It's clearly a 8-bit Latin-variant (the majority of the data is ASCII with a few individual 8-bit codepoints).
So I analysed the data, and found there are just two such codepoints in 9 rows:
>>> import re
>>> eightbit = re.compile(rb'[\x80-\xff]')
>>> with open('albumlist.csv', 'rb') as bindata:
... nonascii = [l for l in bindata if eightbit.search(l)]
...
>>> len(nonascii)
9
>>> {c for l in nonascii for c in eightbit.findall(l)}
{b'\x89', b'\xca'}
The 0x89 byte appears in just one line:
>>> sum(l.count(b'\x89') for l in nonascii)
1
>>> sum(l.count(b'\xca') for l in nonascii)
22
>>> next(l for l in nonascii if b'\x89' in l)
b'359,1972,Honky Ch\x89teau,Elton John,Rock,"Pop Rock,\xcaClassic Rock"\r\n'
That's clearly Elton John's 1972 Honky Château album, so the 0x89 byte must represent the U+00E2 LATIN SMALL LETTER A WITH CIRCUMFLEX codepoint.
The 0xCA bytes all appear to represent an alternative space character, they all appear righ after commas in the genre and subgenre columns (with one album exception):
>>> import csv
>>> for row in csv.reader((l.decode('ascii', 'backslashreplace') for l in nonascii)):
... for col in row:
... if '\\' in col: print(col)
...
Reggae,\xcaPop,\xcaFolk, World, & Country,\xcaStage & Screen
Reggae,\xcaRoots Reggae,\xcaRocksteady,\xcaContemporary,\xcaSoundtrack
Electronic,\xcaStage & Screen
Soundtrack,\xcaDisco
Rock,\xcaBlues
Blues Rock,\xcaElectric Blues,\xcaHarmonica Blues
Garage Rock,\xcaPsychedelic Rock
Honky Ch\x89teau
Pop Rock,\xcaClassic Rock
Funk / Soul,\xcaFolk, World, & Country
Rock,\xcaPop
Stan Getz\xca/\xcaJoao Gilberto\xcafeaturing\xcaAntonio Carlos Jobim
Bossa Nova,\xcaLatin Jazz
Lo-Fi,\xcaIndie Rock
These 0xCA bytes are almost certainly representing the U+00A0 NO-BREAK SPACE codepoint.
With these two mappings, you can try to determine what 8-bit codecs would make the same mapping. Rather than manually try out all Python's codecs I used Tripleee's 8-bit codec mapping to see what codecs use these mappings. There are only two:
- 0x89
â (U+00E2): mac_arabic, mac_croatian, mac_farsi, mac_greek, mac_iceland, mac_roman, mac_romanian, mac_turkish
0xca
(U+00A0): mac_centeuro, mac_croatian, mac_cyrillic, mac_greek, mac_iceland, mac_latin2, mac_roman, mac_romanian, mac_turkish
There are 6 encodings that are listed in both sets:
>>> set1 = set('mac_arabic, mac_croatian, mac_farsi, mac_greek, mac_iceland, mac_roman, mac_romanian, mac_turkish'.split(', '))
>>> set2 = set('mac_centeuro, mac_croatian, mac_cyrillic, mac_greek, mac_iceland, mac_latin2, mac_roman, mac_romanian, mac_turkish'.split(', '))
>>> set1 & set2
{'mac_turkish', 'mac_iceland', 'mac_romanian', 'mac_greek', 'mac_croatian', 'mac_roman'}
Of these, Mac OS Roman mac_roman
codec is probably the most likely to have been used as Microsoft Excel for Mac used Mac Roman to create CSV files for a long time. However, it doesn't really matter, any of those 6 would work here.
You may want to replace those U+00A0 non-breaking spaces if you want to split out the genre and subgenre columns (really the genre and style columns if these were taken from Discogs).
推荐阅读
- python - 如何使用 HuggingFace 变压器管道?
- ios - 从 View Controller 访问 subView 而无需额外的委托
- mysql - 使用基于光标的分页对实时数据进行分页的完美方法是什么?
- python - 如何将 Raspberry Pi 上的 python 更新到至少 3.6?
- node.js - 通过 Flutter 上传照片 graphql
- typescript - 扩展抽象类时接口和类合并
- android - Kotlin 如何获取和打开视频
- javascript - 如何在Vue.js中的v-for期间为给定数组中的数据属性分配随机值?
- typescript - 使用 TypeScript 在 mongoose.Schema 的 `required` 字段中使用函数
- javascript - 如何基于作为参数传递的动态模式的猫鼬模型?