首页 > 解决方案 > 删除列中类似停用词的词

问题描述

我有一个包含对象列和超过 100,000 行的数据框,如下所示:

    df['words']
 0 the
 1 to
 2 of
 3 a
 4 with
 5 as
 6 job
 7 mobil
 8 market
 9 think
 10....

没有停用词的所需输出:

   df['words']
 0 way
 1 http
 2 internet
 3 car
 4 do
 5 want
 6 work
 7 uber
 8....

有没有办法使用 gensim、spacy 或 nltk 在单列中遍历常用的停用词?

我试过了:

from gensim.parsing.preprocessing import remove_stopwords
stopwords.words('english')

df['words'] = df['words'].apply(lambda x: gensim.parsing.preprocessing.remove_stopwords(" ".join(x)))

但这会导致:

TypeError: can only join an iterable

标签: pythonpandasnltkgensimstop-words

解决方案


使用 nltk 去除停用词。导入包

import pandas as pd
from nltk.corpus import stopwords

创建停用词列表

stop_words = stopwords.words('english')
stop_words[:10]

然后,

df['newword'] = list(map(lambda line: list(filter(lambda word: word not in stop_words, line)), df.words))
df

推荐阅读