python-3.x - 从熊猫数据框中删除自定义停用词不起作用
问题描述
我正在尝试删除自定义停用词列表,但它不起作用。
desc = pd.DataFrame(description, columns =['description'])
print(desc)
这给出了以下结果
description
188693 The Kentucky Cannabis Company and Bluegrass He...
181535 Ohio County Sheriff
11443 According to new reports from federal authorit...
213919 KANSAS CITY, Mo. (AP)The Chiefs will be withou...
171509 The crew of Insight, WCNY's weekly public affa...
... ...
2732 The Arkansas Supreme Court on Thursday cleared...
183367 Larry Pegram, co-owner of Pure Ohio Wellness, ...
134291 Joe Biden will spend the next five months pres...
239270 Find out where your Texas representatives stan...
246070 SAN TAN VALLEY — Two men have been charged wit...
[9875 rows x 1 columns]
我在这里找到了以下代码,但它似乎不起作用
remove_words = ["marijuana", "cannabis", "hemp", "thc", "cbd"]
pat = '|'.join([r'\b{}\b'.format(w) for w in remove_words])
desc.assign(new_desc=desc.replace(dict(string={pat: ''}), regex=True))
产生以下结果
description new_desc
188693 The Kentucky Cannabis Company and Bluegrass He... The Kentucky Cannabis Company and Bluegrass He...
181535 Ohio County Sheriff Ohio County Sheriff
11443 According to new reports from federal authorit... According to new reports from federal authorit...
213919 KANSAS CITY, Mo. (AP)The Chiefs will be withou... KANSAS CITY, Mo. (AP)The Chiefs will be withou...
171509 The crew of Insight, WCNY's weekly public affa... The crew of Insight, WCNY's weekly public affa...
... ... ...
2732 The Arkansas Supreme Court on Thursday cleared... The Arkansas Supreme Court on Thursday cleared...
183367 Larry Pegram, co-owner of Pure Ohio Wellness, ... Larry Pegram, co-owner of Pure Ohio Wellness, ...
134291 Joe Biden will spend the next five months pres... Joe Biden will spend the next five months pres...
239270 Find out where your Texas representatives stan... Find out where your Texas representatives stan...
246070 SAN TAN VALLEY — Two men have been charged wit... SAN TAN VALLEY — Two men have been charged wit...
9875 rows × 2 columns
如您所见,停用词没有被删除。您能提供的任何帮助将不胜感激。
解决方案
处理案件,简化图案,
remove_words = ["marijuana", "cannabis", "hemp", "thc", "cbd"]
pat = '|'.join(remove_words)
desc['new_desc'] = desc.description.str.lower().replace(pat,'', regex=True)
description new_desc
0 The Kentucky Cannabis Company and Bluegrass He... the kentucky company and bluegrass he...
1 Ohio County Sheriff ohio county sheriff
2 According to new reports from federal authorit... according to new reports from federal authorit...
3 KANSAS CITY, Mo. (AP)The Chiefs will be mariju... kansas city, mo. (ap)the chiefs will be witho...
4 The crew of Insight, WCNY's weekly public affa... the crew of insight, wcny's weekly public affa...
推荐阅读
- google-chrome - 微观策略 pdf 导出设置
- python - pi、euler、黄金比例等常数的金字塔形表示
- java - 代码抛出 SQLException 但 SQL 代码执行成功
- tabulator - 重命名后单独订购组
- d3.js - D3.js各种链接距离
- python - 使用 PyWinAuto 控制 Windows 应用程序 (HMA VPN)
- ionic-framework - Deeplink 仅适用于自定义方案而不是 HTTPS
- javascript - 如何让不同的按钮弹出单独的弹出框?
- mysql - 如何按顺序使用来自 kafka 的消息?
- java - 我需要帮助!我一直在尝试编写一个 Java 程序,该程序将打印出由用户输入确定的给定数字的最大素数