python - 词袋标记每个字母而不是单词
问题描述
我在网上找到了一袋单词的实现。我正在阅读一个包含许多句子的文本文件,我将通过 generate_bagOfWords
def stopword_clean(sentence):
ignore = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself",
"yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself",
"they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these",
"those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do",
"does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while",
"of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before",
"after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again",
"further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each",
"few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than",
"too", "very", "s", "t", "can", "will", "just", "don", "should", "now", "This", "It", "I"]
words = re.sub("[^\w]", " ", sentence).split()
clean_text = [w for w in words if w.lower() not in ignore]
return clean_text
def tokenize(sentences):
words = []
for sentence in sentences:
x = stopword_clean(sentence)
words.extend(x)
words = sorted(list(set(words)))
return words
def generate_bagOfWords(finalsentences):
vocab = tokenize(finalsentences)
print("Word list for document \n{0} \n", format(vocab));
for sentence in finalsentences:
words = stopword_clean(sentence)
bag_vector = numpy.zeros(len(vocab))
for w in words:
for i, word in enumerate(vocab):
if word == w:
bag_vector[i] += 1
print("{0} \n{1}\n".format(sentence, numpy.array(bag_vector)))
问题是我读了很多这样的句子:
trainingFile = open(r"D:\Desktop\\1565964985_2925534_train_file.data", "r")
# arrays for the sentiments and reviews
sentiment = []
review = []
# for loop that reads each line
for line in trainingFile:
# data field array separated by tab
dataFields = line.split("\t")
# sentiment holds the positive or negative sentiment of the review
sentiment.append(dataFields[0])
# review holds the text from the review
review.append(dataFields[1])
这使我的索引如下:
Review[0]: This book is such a life saver. It has been so helpful to be able to go back to track trends, answer pediatrician questions, or communicate with each other when you are up at different times of the night with a newborn. I think it is one of those things that everyone should be required to have before they leave the hospital. We went through all the pages of the newborn version, then moved to the infant version, and will finish up the second infant book (third total) right as our baby turns 1. See other things that are must haves for baby at [...]
我得到这个的输出
['1', 'W', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 'u', 'v', 'w', 'y']
T
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
h
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
i
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
s
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
但是对于整个句子。
但是,如果我这样做
review[0] = ["This book is such a life saver. It has been so helpful to be able to go back to track trends, answer pediatrician questions, or communicate with each other when you are up at different times of the night with a newborn. I think it is one of those things that everyone should be required to have before they leave the hospital. We went through all the pages of the newborn version, then moved to the infant version, and will finish up the second infant book (third total) right as our baby turns 1. See other things that are must haves for baby at [...]"]
它工作正常
['1', 'See', 'able', 'answer', 'baby', 'back', 'book', 'communicate', 'different', 'everyone', 'finish', 'go', 'haves', 'helpful', 'hospital', 'infant', 'leave', 'life', 'moved', 'must', 'newborn', 'night', 'one', 'pages', 'pediatrician', 'questions', 'required', 'right', 'saver', 'second', 'things', 'think', 'third', 'times', 'total', 'track', 'trends', 'turns', 'version', 'went']
This book is such a life saver. It has been so helpful to be able to go back to track trends, answer pediatrician questions, or communicate with each other when you are up at different times of the night with a newborn. I think it is one of those things that everyone should be required to have before they leave the hospital. We went through all the pages of the newborn version, then moved to the infant version, and will finish up the second infant book (third total) right as our baby turns 1. See other things that are must haves for baby at [...]
[1. 1. 1. 1. 2. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 2. 1. 1. 1.
1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 2. 1.]
这里发生了什么,如何将整个数组转换为正常工作的字符串?
解决方案
推荐阅读
- excel - 检查每一行并在新工作表中相应地添加或更新它
- html - 如何使用css将边框添加到带有th的tr
- git - 当某些文件更改时,自动请求更改 Github 上的 PR
- python - 如何解决错误“无法从'serial'(未知位置)导入名称'serial'”?
- node.js - OverwriteModelError:编译后无法覆盖“用户”模型
- javascript - 如何使用纯 JavaScript 将值设置为对象内的参数?
- windows - 如何通过 CMD 禁用 MicrosoftWindowsPowerShellV2Root?
- amazon-web-services - AWS Lambda / API Gateway - 调用允许经过身份验证和未经身份验证的请求的函数的最佳方式是什么?
- javascript - 将 JSX 分配给变量时是否忽略括号
- java - A question about Producer-Consumer Model in Java