首页 > 解决方案 > 删除数据帧的每个标记化行中的停用词

问题描述

我正在尝试从数据框的每一行中删除停用词并将其放入新的数据框列 S 中。

我试过下面的代码,但它似乎不起作用......

from nltk.corpus import stopwords
stopwords = stopwords.words('english')

df['S'] = df.apply(lambda row: (word for word in row['remarks_tokenized'] if word.lower() not in stopwords), axis=1)

在此处输入图像描述

标签: pythonpandasnltkstop-words

解决方案


我为不同的语料库尝试了这个并且它有效。

from nltk.corpus import stopwords  
from nltk.tokenize import word_tokenize  
stop_words = set(stopwords.words('english'))  

def remove_stopwords(sentence):
    word_tokens = word_tokenize(sentence)  
    clean_tokens = [w for w in word_tokens if not w in stop_words]  
    
    return clean_tokens
    
df['S'] = df['remarks'].apply(remove_stopwords)

输出:

0     [microsoft, word, arma2011paper353, prediction...
1     [2504, 0478, matava, qxd, gulf, mexico, mature...
2     [lithospheric, structure, texas, gulf, mexico,...
4     [int, see, discussions, stats, author, profile...
5     [bltn9556, authors, thomas, r, taylor, shell, ...
7     [high, resolution, reservoir, characterization...
8     [untitled, journal, sedimentary, research, v, ...
9     [doi, j, epsl, www, elsevier, com, locate, eps...
10    [authors, dale, e, bird, department, geoscienc...
11    [spe, ms, spe, ms, taking, co2, enhanced, oil,...

推荐阅读