首页 > 解决方案 > 我如何用这个 python 代码过滤一个段落它可以工作但只过滤一半的单词并留下其余的,我不知道原因

问题描述

首先,我创建空 DICTIONARY 来计算段落中的单词以及段落的变量

D={}
x="Wow, first off, a huge congrats. You're about to start the final project of the course. Amazing job making it all the way here. I hope you've been having as much fun as I have on this journey. We're going to be working on our final project using Jupyter notebooks. You might be starting to feel pretty confident using them. But remember if you have any issues you can always ask for help in the discussion forums. Okay, before we dive in, we're going to chat a little bit about what you'll be doing for the project, it's going to be really fun. The goal of the project is to create a word cloud. A word cloud is an image that's made up of different sized words. Usually the sizes of the words are determined by how many times each word appears in a specific text. To create the image itself, we're going to use an external Python module called creatively Word cloud. [LAUGH] Your job is to create a script that would go through the text and count how many times each word appears. We've done this a few times already, any ideas how we should tackle this one? If you're thinking of using a dictionary to count how many times each word appears, then. Ding ding ding, good answer! You're going to prepare a dictionary and use that as a parameter for the word cloud module, not too tricky, right? I think you can handle a little more, so two things you have to watch out for. One, punctuation marks, before counting the frequency of the words, you need to make sure that there are no punctuation marks in the text. If you don't, a string example with a comma at the end would be different from a string example with a dot at the end. So before you put words into the dictionary, you have to clean up the text to remove any punctuation marks. And the second thing we want to keep our word cloud interesting. Certain words in our language crop up a lot and if we include all of these we're going to get a pretty dull word cloud. Think about words like, the, two or if. They usually appear a whole lot in text but aren't too relevant to the text's overall message. We want our Cloud to show words that are relevant to the text we're using for the input. So you need to find a way to exclude irrelevant or uninteresting words when processing the text. For the input, you're going to upload a text file. You can choose any text file you like for your input. It could be the contents of a website, a full novel or even everything that one author has ever written. You just need to make sure that it's one text file, so that it can be processed by the code. Okay, before jumping into the project, remember you can re-watch this video If something isn't clear. Yep, I'm starting to sound like a broken record, but this time it's extra extra important. This final project is the real test of how much you've gotten your head around and can highlight areas you need to brush up on. So we want you to be super clear on what you need to do on that point. You'll find an overview of what you have to do in the next reading. Can you guess the best way of tackling this problem? Yep, you got it, our step-by-step approach that we outlined earlier. Understand the problem statement, research available options, plan your approach, write your code and finally execute. Okay, feeling good? Ready to go? Remember, you know this stuff and you've totally got it."

我想过滤掉这些符号组合的单词

t= "!()-[]{};:\,<>./?@#$%^&*_~'"

这是我编写的第一个代码,它使段落变为小写并将其拆分为列表

b=x.lower()
y=b.split()
print(y)

这是它过滤带有符号的单词的第二个代码,但它只过滤其中的一部分,除非我再执行几次”

    for i in y:
      for o in range(len(t)):
        if t[o] in i:
            if i not in y:
                continue
            W=y.index(i)
            y.pop(W)
    #print(y)
    print(len(y))

这些是我在符号之后过滤掉的词

uninteresting_words = ['a', "the", "to", "if", "is", "it", "of", "and", "or", "an", "as", "i", "me", "my", \
        "we", "our", "ours", "you", "your", "yours", "he", "she", "him", "his", "her", "hers", "its", "they", "them", \
        "their", "what", "which", "who", "whom", "this", "that", "am", "are", "was", "were", "be", "been", "being", \
        "have", "has", "had", "do", "does", "did", "but", "at", "by", "with", "from", "here", "when", "where", "how", \
        "all", "any", "both", "each", "few", "more", "some", "such", "no", "nor", "too", "very", "can", "will", "just"]

这是我为它编写的代码,但除非我执行几次,否则这段代码也不会过滤掉所有的单词

    count=0
    for n in y:
        for n in uninteresting_words:
            h=y.index(n)
            count+=1
            y.pop(h)
    print(count)        
    #print(y)
    print(len(y))

这是将单词添加到 dic 的代码

  for words in y:
            D[words]=D.get(words,0)+1
  #print(D)

标签: python

解决方案


问题:

问题是list.index并且list.pop只删除列表中第一次出现的项目

解决方案

改为使用列表推导:

代替

W=y.index(i)
y.pop(W)

h=y.index(n)
y.pop(h)

y = [x for x in y if x != i]

y = [x for x in y if x != n]

推荐阅读