首页 > 解决方案 > 使用 Gensim(Python)提取二元组时出现 TypeError

问题描述

我想使用 Gensim 提取和打印二元组。为此,我在 GoogleColab 中使用了该代码:

import gensim.downloader as api
from gensim.models import Word2Vec
from gensim.corpora import WikiCorpus, Dictionary
from gensim.models import Phrases
from gensim.models.phrases import Phraser
from collections import Counter

data = api.load("text8") # wikipedia corpus
bigram = Phrases(data, min_count=3, threshold=10)


cntr = Counter()
for key in bigram.vocab.keys():
  if len(key.split('_')) > 1:
    cntr[key] += bigram.vocab[key]

for key, counts in cntr.most_common(50):
  print(key, " - ", counts)

但是有一个错误:

类型错误

然后我尝试了这个:

cntr = Counter()
for key in bigram.vocab.keys():
  if len(key.split(b'_')) > 1:
    cntr[key] += bigram.vocab[key]

for key, counts in cntr.most_common(50):
  print(key, " - ", counts)

接着:

再次

怎么了?

标签: pythonmachine-learningnlpgensim

解决方案


 bigram_token  = list(bigram.vocab.keys())
 type(bigram_token[0])

 #op
 bytes

将其转换为字符串,它将在您的代码中解决问题,只是在拆分时

cntr = Counter()
for key in bigram.vocab.keys():
    if len(key.decode('utf-8').split(b'_')) > 1: # here added .decode('utf-8')
       cntr[key] += bigram.vocab[key]

推荐阅读