python - 我在词袋功能中犯了什么错误?
问题描述
我是 Python 新手。编写了指定词袋的函数。
DICT_SIZE = 5000
WORDS_TO_INDEX = words_counts
"""INDEX_TO_WORDS = ####### YOUR CODE HERE #######"""
ALL_WORDS = WORDS_TO_INDEX.keys()
它的功能:
def my_bag_of_words(text, words_to_index, dict_size):
"""
text: a string
dict_size: size of the dictionary
return a vector which is a bag-of-words representation of 'text'
"""
result_vector = np.zeros(dict_size)
sentence_tokens = nltk.word_tokenize(text)
attributes = []
for i, k in words_to_index.items():
if k<dict_size:
attributes.append(i)
for i in attributes:
for k in sentence_tokens:
if i==k:
result_vector[attributes.index(i)]=+1
return result_vector
我尝试测试该功能,它也可以工作
def test_my_bag_of_words():
words_to_index = {'hi': 0, 'you': 1, 'me': 2, 'are': 3}
examples = ['hi how are you']
answers = [[1, 1, 0, 1]]
for ex, ans in zip(examples, answers):
if (my_bag_of_words(ex, words_to_index, 4) != ans).any():
print(my_bag_of_words(ex, words_to_index, 4))
return "Wrong answer for the case: '%s'" % ex
return 'Basic tests are passed.'
print(test_my_bag_of_words())
Basic tests are passed.
在我想将它应用到数据集中的所有文本之后
X_train_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_train])
X_val_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_val])
X_test_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_test])
print('X_train shape ', X_train_mybag.shape)
print('X_val shape ', X_val_mybag.shape)
print('X_test shape ', X_test_mybag.shape)
在这种情况下会出现错误:
IndexError Traceback (most recent call last)
<ipython-input-30-364e76658e6f> in <module>()
----> 1 X_train_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_train])
2 X_val_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_val])
3 X_test_mybag = sp_sparse.vstack([sp_sparse.csr_matrix(my_bag_of_words(text, WORDS_TO_INDEX, DICT_SIZE)) for text in X_test])
4 print('X_train shape ', X_train_mybag.shape)
5 print('X_val shape ', X_val_mybag.shape)
1 frames
<ipython-input-25-814e004d61c2> in my_bag_of_words(text, words_to_index, dict_size)
20 for k in sentence_tokens:
21 if i==k:
---> 22 result_vector[attributes.index(i)]=+1
23 return result_vector
IndexError: index 5000 is out of bounds for axis 0 with size 5000
谁能帮我理解我在函数 my_bag_of_words 的代码中犯了什么错误,好吗?
解决方案
该变量words_to_index
包含比您的词汇限制更高的索引。您应该增加您的词汇限制或确保word_to_index
仅包含 < 5000 的索引(例如通过丢弃最不常用的单词)。
推荐阅读
- hibernate - 如果我不想在 Hibernate 中使用 @Generated 注释,还有什么其他选择?
- python - 如何在python中使用正则表达式捕获字符串并将其替换为所需的字符串
- android - Android 媒体播放器/框架种类
- php - PhpStorm:找不到类“PhpUnit\Framework\TestCase”(作曲家/自动加载)
- opencv - 使用opencv绘制形状
- fuzzing - Boofuzz 基于组值创建嵌套块
- google-apps-script - 从我的表单中保存谷歌表格中的信息
- javascript - 如何从父页面中删除 iframe 中的确认对话框?
- unit-testing - VSTS / VSTest 不遵守 [ExcludeFromCodeCoverage]
- string - 需要语法将后缀字符串验证添加到电子邮件地址格式验证正则表达式