首页 > 解决方案 > 词袋标记每个字母而不是单词

问题描述

我在网上找到了一袋单词的实现。我正在阅读一个包含许多句子的文本文件,我将通过 generate_bagOfWords

def stopword_clean(sentence):
    ignore = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself",
              "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself",
              "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these",
              "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do",
              "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while",
              "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before",
              "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again",
              "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each",
              "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than",
              "too", "very", "s", "t", "can", "will", "just", "don", "should", "now", "This", "It", "I"]

    words = re.sub("[^\w]", " ", sentence).split()
    clean_text = [w for w in words if w.lower() not in ignore]
    return clean_text


def tokenize(sentences):
    words = []
    for sentence in sentences:
        x = stopword_clean(sentence)
        words.extend(x)

    words = sorted(list(set(words)))
    return words


def generate_bagOfWords(finalsentences):
    vocab = tokenize(finalsentences)
    print("Word list for document \n{0} \n", format(vocab));

    for sentence in finalsentences:
        words = stopword_clean(sentence)
        bag_vector = numpy.zeros(len(vocab))
        for w in words:
            for i, word in enumerate(vocab):
                if word == w:
                    bag_vector[i] += 1

        print("{0} \n{1}\n".format(sentence, numpy.array(bag_vector)))      

问题是我读了很多这样的句子:

trainingFile = open(r"D:\Desktop\\1565964985_2925534_train_file.data", "r")

# arrays for the sentiments and reviews
sentiment = []
review = []

# for loop that reads each line
for line in trainingFile:
    # data field array separated by tab
    dataFields = line.split("\t")

    # sentiment holds the positive or negative sentiment of the review
    sentiment.append(dataFields[0])
    # review holds the text from the review
    review.append(dataFields[1])

这使我的索引如下:

Review[0]: This book is such a life saver.  It has been so helpful to be able to go back to track trends, answer pediatrician questions, or communicate with each other when you are up at different times of the night with a newborn.  I think it is one of those things that everyone should be required to have before they leave the hospital.  We went through all the pages of the newborn version, then moved to the infant version, and will finish up the second infant book (third total) right as our baby turns 1.  See other things that are must haves for baby at [...]

我得到这个的输出

['1', 'W', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 'u', 'v', 'w', 'y']
T 
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

h 
[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

i 
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

s 
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

但是对于整个句子。

但是,如果我这样做

    review[0] = ["This book is such a life saver.  It has been so helpful to be able to go back to track trends, answer pediatrician questions, or communicate with each other when you are up at different times of the night with a newborn.  I think it is one of those things that everyone should be required to have before they leave the hospital.  We went through all the pages of the newborn version, then moved to the infant version, and will finish up the second infant book (third total) right as our baby turns 1.  See other things that are must haves for baby at [...]"]

它工作正常

['1', 'See', 'able', 'answer', 'baby', 'back', 'book', 'communicate', 'different', 'everyone', 'finish', 'go', 'haves', 'helpful', 'hospital', 'infant', 'leave', 'life', 'moved', 'must', 'newborn', 'night', 'one', 'pages', 'pediatrician', 'questions', 'required', 'right', 'saver', 'second', 'things', 'think', 'third', 'times', 'total', 'track', 'trends', 'turns', 'version', 'went']
This book is such a life saver.  It has been so helpful to be able to go back to track trends, answer pediatrician questions, or communicate with each other when you are up at different times of the night with a newborn.  I think it is one of those things that everyone should be required to have before they leave the hospital.  We went through all the pages of the newborn version, then moved to the infant version, and will finish up the second infant book (third total) right as our baby turns 1.  See other things that are must haves for baby at [...] 
[1. 1. 1. 1. 2. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 2. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 2. 1.]

这里发生了什么,如何将整个数组转换为正常工作的字符串?

标签: pythonstringlist

解决方案


推荐阅读