python - 数据框在满足条件时选择连续跨度
问题描述
假设,我有停用词列表:
STOP = ['under', 'its', 'agreement', 'financed']
对于给定的数据框:
lst = ['Kan.-based National', 'Kan.-based National Pizza', 'stock market',
'Pittsburg Kan.-based National Pizza', 'the stock market', 'revolving credit',
'revolving credit agreement', 'its revolving credit agreement', 'under its revolving credit agreement',
'financed under its revolving credit agreement']
df = pd.DataFrame(lst)
这是:
0 Kan.-based National
1 Kan.-based National Pizza
2 stock market
3 Pittsburg Kan.-based National Pizza
4 the stock market
5 revolving credit
6 revolving credit agreement
7 its revolving credit agreement
8 under its revolving credit agreement
9 financed under its revolving credit agreement
我想获得:
out = ['Pittsburg Kan.-based National Pizza', 'the stock market', 'revolving credit',
'revolving credit agreement', 'its revolving credit agreement', 'under its revolving credit agreement',
'financed under its revolving credit agreement']
df_out = pd.DataFrame(out)
这是:
0 Pittsburg Kan.-based National Pizza
1 the stock market
2 revolving credit
3 revolving credit agreement
4 its revolving credit agreement
5 under its revolving credit agreement
6 financed under its revolving credit agreement
注意:行的顺序并不重要。
解释:
由于'Kan.-based National'
和'Kan.-based National Pizza'
仅相差一个单词'Pizza'
并且列表中没有单词STOP
,因此我们要选择最长的跨度,即'Kan.-based National Pizza'
。但是,也只有一个单词'Pittsburg Kan.-based National Pizza'
不同,并且列表中没有单词,我们要选择最长的跨度,即。'Kan.-based National Pizza'
'Pittsburg'
STOP
'Pittsburg Kan.-based National Pizza'
我们不能选择'financed under its revolving credit agreement'
以 开头的最长跨度,'revolving credit'
因为列表中存在这些单词STOP
。因此,我们不会删除它的较小跨度。
或者,在旁道上,如果字符串以 (a|an|the) 开头,并且它的常见跨度之间的差异只是一个单词。例如 -"stock market"
和"the stock market"
,我们要选择最长的跨度,即"the stock market"
。
我试着做:
delete_from_best_constituents = []
for u in best_parse_constituents:
for v in best_parse_constituents:
if u.lower().startswith('the') or v.lower().startswith('the'):
u_part = u.lower().split('the')[-1].strip()
v_part = v.lower().split('the')[-1].strip()
cond1 = all([w.lower() not in STOP for w in u_part.split()])
cond2 = all([w.lower() not in STOP for w in v_part.split()])
if u_part == v.lower() or v_part == u.lower() and cond1 and cond2:
if not u.lower().startswith('the'):
delete_from_best_constituents.append(u)
解决方案
推荐阅读
- javascript - node cli - 在运行另一个之前需要文件
- angular - PrimeNG 下拉选择的选项在与接口属性绑定时重置
- xamarin.forms - 使用 mvvmCross 掌握详细信息 xamarin 表单
- tensorflow - python 3.5.2 中 imort tensorflow as tf 的错误
- python - 如何处理无唯一模式;在以下函数中找到 2 个同样常见的值
- ruby - 为什么 `mixlibshellout` 没有从指定位置执行?
- python - Pyglet 无法加载 GLU 库
- elasticsearch - Elasticsearch 仍然会在查询时使用指定的路由字段命中多个分片
- php - 循环遍历序列化数据以将更新插入 MySQL DB
- c# - 我正在尝试在 c# 中转换此函数,请帮助我