首页 > 解决方案 > 从数据框中删除标点符号和停用词

问题描述

我的数据框看起来像 -

State                           text
Delhi                  170 kw for330wp, shipping and billing in delhi...
Gujarat                4kw rooftop setup for home Photovoltaic Solar...
Karnataka              language barrier no requirements 1kw rooftop ...
Madhya Pradesh         Business PartnerDisqualified Mailed questionna...
Maharashtra            Rupdaypur, panskura(r.s) Purba Medinipur 150kw...

我想从此数据框中删除标点符号和停用词。我已经完成了以下代码。但它不起作用 -

import nltk
nltk.download('stopwords')

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import string
import collections
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
import matplotlib.cm as cm
import matplotlib.pyplot as plt
% matplotlib inline
import nltk
from nltk.corpus import stopwords
import string
from sklearn.feature_extraction.text import CountVectorizer
import re

def message_cleaning(message):
    Test_punc_removed = [char for char in message if char not in string.punctuation]
    Test_punc_removed_join = ''.join(Test_punc_removed)
    Test_punc_removed_join_clean = [word for word in Test_punc_removed_join.split() if word.lower() not in stopwords.words('english')]
    return Test_punc_removed_join_clean

df['text'] = df['text'].apply(message_cleaning)

AttributeError: 'set' object has no attribute 'words'

标签: python-3.xpandasscikit-learnnltk

解决方案


问题:我相信您对stopwords. 您的笔记本中可能有一行您分配的位置:

stopwords = stopwords.words("english")

这可以解释这个问题,因为调用stopwords会变得模棱两可:你指的是变量而不是包。

解决方案:使事情明确:

  1. 首先分配一个引用停用词的变量(这比每次都调用它要快)
from nltk.corpus import stopwords
english_stop_words = set(stopwords.words("english")) 
  1. 在你的函数中使用它:
Test_punc_removed_join_clean = [
    word for word in Test_punc_removed_join.split() 
    if word.lower() not in english_stop_words
]

推荐阅读