首页 > 解决方案 > 删除python中的废话

问题描述

我想在我的数据集中删除无意义的词。

我试过了,我看到 StackOverflow 是这样的:

import nltk
words = set(nltk.corpus.words.words())

sent = "Io andiamo to the beach with my amico."
" ".join(w for w in nltk.wordpunct_tokenize(sent) \
     if w.lower() in words or not w.isalpha())

但是现在因为我有一个数据框,我如何在整个列中迭代它。

我试过这样的事情:

import nltk
words = set(nltk.corpus.words.words())

sent = df['Chats']
df['Chats'] = df['Chats'].apply(lambda w:" ".join(w for w in 
nltk.wordpunct_tokenize(sent) \
     if w.lower() in words or not w.isalpha()))

但我收到一个错误 TypeError: expected string or bytes-like object

标签: pythonmachine-learningnlpnltk

解决方案


类似以下内容将生成一个列Clean,将您的函数应用于该列Chats

words = set(nltk.corpus.words.words())

def clean_sent(sent):
    return " ".join(w for w in nltk.wordpunct_tokenize(sent) \
     if w.lower() in words or not w.isalpha())

df['Clean'] = df['Chats'].apply(clean_sent)

要更新Chats列本身,您可以使用原始列覆盖它:

df['Chats'] = df['Chats'].apply(clean_sent)

推荐阅读