我在网上找到了一袋单词的实现。我正在阅读一个包含许多句子的文本文件,我将通过 generate_bagOfWords

def stopword_clean(sentence):
    ignore = ["i", "me", "my", "myself", "we", "our", "ours", "ourselves", "you", "your", "yours", "yourself",
              "yourselves", "he", "him", "his", "himself", "she", "her", "hers", "herself", "it", "its", "itself",
              "they", "them", "their", "theirs", "themselves", "what", "which", "who", "whom", "this", "that", "these",
              "those", "am", "is", "are", "was", "were", "be", "been", "being", "have", "has", "had", "having", "do",
              "does", "did", "doing", "a", "an", "the", "and", "but", "if", "or", "because", "as", "until", "while",
              "of", "at", "by", "for", "with", "about", "against", "between", "into", "through", "during", "before",
              "after", "above", "below", "to", "from", "up", "down", "in", "out", "on", "off", "over", "under", "again",
              "further", "then", "once", "here", "there", "when", "where", "why", "how", "all", "any", "both", "each",
              "few", "more", "most", "other", "some", "such", "no", "nor", "not", "only", "own", "same", "so", "than",
              "too", "very", "s", "t", "can", "will", "just", "don", "should", "now", "This", "It", "I"]

    words = re.sub("[^\w]", " ", sentence).split()
    clean_text = [w for w in words if w.lower() not in ignore]
    return clean_text

def tokenize(sentences):
    words = []
    for sentence in sentences:
        x = stopword_clean(sentence)

    words = sorted(list(set(words)))
    return words

def generate_bagOfWords(finalsentences):
    vocab = tokenize(finalsentences)
    print("Word list for document \n{0} \n", format(vocab));

    for sentence in finalsentences:
        words = stopword_clean(sentence)
        bag_vector = numpy.zeros(len(vocab))
        for w in words:
            for i, word in enumerate(vocab):
                if word == w:
                    bag_vector[i] += 1

        print("{0} \n{1}\n".format(sentence, numpy.array(bag_vector)))      


trainingFile = open(r"D:\Desktop\\1565964985_2925534_train_file.data", "r")

# arrays for the sentiments and reviews
sentiment = []
review = []

# for loop that reads each line
for line in trainingFile:
    # data field array separated by tab
    dataFields = line.split("\t")

    # sentiment holds the positive or negative sentiment of the review
    # review holds the text from the review


Review[0]: This book is such a life saver.  It has been so helpful to be able to go back to track trends, answer pediatrician questions, or communicate with each other when you are up at different times of the night with a newborn.  I think it is one of those things that everyone should be required to have before they leave the hospital.  We went through all the pages of the newborn version, then moved to the infant version, and will finish up the second infant book (third total) right as our baby turns 1.  See other things that are must haves for baby at [...]


['1', 'W', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 'u', 'v', 'w', 'y']
[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

[0. 0. 0. 0. 0. 0. 0. 0. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]

[0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]



    review[0] = ["This book is such a life saver.  It has been so helpful to be able to go back to track trends, answer pediatrician questions, or communicate with each other when you are up at different times of the night with a newborn.  I think it is one of those things that everyone should be required to have before they leave the hospital.  We went through all the pages of the newborn version, then moved to the infant version, and will finish up the second infant book (third total) right as our baby turns 1.  See other things that are must haves for baby at [...]"]


['1', 'See', 'able', 'answer', 'baby', 'back', 'book', 'communicate', 'different', 'everyone', 'finish', 'go', 'haves', 'helpful', 'hospital', 'infant', 'leave', 'life', 'moved', 'must', 'newborn', 'night', 'one', 'pages', 'pediatrician', 'questions', 'required', 'right', 'saver', 'second', 'things', 'think', 'third', 'times', 'total', 'track', 'trends', 'turns', 'version', 'went']
This book is such a life saver.  It has been so helpful to be able to go back to track trends, answer pediatrician questions, or communicate with each other when you are up at different times of the night with a newborn.  I think it is one of those things that everyone should be required to have before they leave the hospital.  We went through all the pages of the newborn version, then moved to the infant version, and will finish up the second infant book (third total) right as our baby turns 1.  See other things that are must haves for baby at [...] 
[1. 1. 1. 1. 2. 1. 2. 1. 1. 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 2. 1. 1. 1.
 1. 1. 1. 1. 1. 1. 2. 1. 1. 1. 1. 1. 1. 1. 2. 1.]


