python - 删除数据帧的每个标记化行中的停用词
问题描述
我正在尝试从数据框的每一行中删除停用词并将其放入新的数据框列 S 中。
我试过下面的代码,但它似乎不起作用......
from nltk.corpus import stopwords
stopwords = stopwords.words('english')
df['S'] = df.apply(lambda row: (word for word in row['remarks_tokenized'] if word.lower() not in stopwords), axis=1)
解决方案
我为不同的语料库尝试了这个并且它有效。
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
stop_words = set(stopwords.words('english'))
def remove_stopwords(sentence):
word_tokens = word_tokenize(sentence)
clean_tokens = [w for w in word_tokens if not w in stop_words]
return clean_tokens
df['S'] = df['remarks'].apply(remove_stopwords)
输出:
0 [microsoft, word, arma2011paper353, prediction...
1 [2504, 0478, matava, qxd, gulf, mexico, mature...
2 [lithospheric, structure, texas, gulf, mexico,...
4 [int, see, discussions, stats, author, profile...
5 [bltn9556, authors, thomas, r, taylor, shell, ...
7 [high, resolution, reservoir, characterization...
8 [untitled, journal, sedimentary, research, v, ...
9 [doi, j, epsl, www, elsevier, com, locate, eps...
10 [authors, dale, e, bird, department, geoscienc...
11 [spe, ms, spe, ms, taking, co2, enhanced, oil,...
推荐阅读
- mongodb - mongodump - not authenticating when using secondary
- r - MCMCglmm with genomic relatedness matrix: "levels do not have a row entry in ginverse"
- python - 如何使用带图像的边界框进行多标签图像训练?
- html - Border until text content
- uml - UML 用例图 - IF 条件?
- node.js - Angular +Workbox = build ChunkLoadError: Loading chunk # and Refused to execute script because its MIME
- react-native - Why cant xcode find
? - jquery - Selectize.js - Populate select element with result of ajax request
- sql - 多个JOIN,一张表需要两次
- android - Anroid bottomsheet 库 com.cocosw:bottomsheet