首页 > 解决方案 > 数据框在满足条件时选择连续跨度

问题描述

假设,我有停用词列表:

STOP = ['under', 'its', 'agreement', 'financed'] 

对于给定的数据框:

lst = ['Kan.-based National', 'Kan.-based National Pizza', 'stock market', 
   'Pittsburg Kan.-based National Pizza', 'the stock market', 'revolving credit',
   'revolving credit agreement', 'its revolving credit agreement', 'under its revolving credit agreement', 
   'financed under its revolving credit agreement']

df = pd.DataFrame(lst)

这是:

0   Kan.-based National
1   Kan.-based National Pizza
2   stock market
3   Pittsburg Kan.-based National Pizza
4   the stock market
5   revolving credit
6   revolving credit agreement
7   its revolving credit agreement
8   under its revolving credit agreement
9   financed under its revolving credit agreement

我想获得:

out = ['Pittsburg Kan.-based National Pizza', 'the stock market', 'revolving credit',
       'revolving credit agreement', 'its revolving credit agreement', 'under its revolving credit agreement', 
       'financed under its revolving credit agreement']

df_out = pd.DataFrame(out)

这是:

0   Pittsburg Kan.-based National Pizza
1   the stock market
2   revolving credit
3   revolving credit agreement
4   its revolving credit agreement
5   under its revolving credit agreement
6   financed under its revolving credit agreement

注意:行的顺序并不重要。

解释:

由于'Kan.-based National''Kan.-based National Pizza'仅相差一个单词'Pizza'并且列表中没有单词STOP,因此我们要选择最长的跨度,即'Kan.-based National Pizza'。但是,也只有一个单词'Pittsburg Kan.-based National Pizza'不同,并且列表中没有单词,我们要选择最长的跨度,即。'Kan.-based National Pizza''Pittsburg'STOP'Pittsburg Kan.-based National Pizza'

我们不能选择'financed under its revolving credit agreement'以 开头的最长跨度,'revolving credit'因为列表中存在这些单词STOP。因此,我们不会删除它的较小跨度。

或者,在旁道上,如果字符串以 (a|an|the) 开头,并且它的常见跨度之间的差异只是一个单词。例如 -"stock market""the stock market",我们要选择最长的跨度,即"the stock market"

我试着做:

delete_from_best_constituents = []
for u in best_parse_constituents:
    for v in best_parse_constituents:
        if u.lower().startswith('the') or v.lower().startswith('the'):
            u_part = u.lower().split('the')[-1].strip()
            v_part =  v.lower().split('the')[-1].strip()
            cond1 = all([w.lower() not in STOP for w in u_part.split()])
            cond2 = all([w.lower() not in STOP for w in v_part.split()])
            if u_part == v.lower() or v_part == u.lower() and cond1 and cond2:
                if not u.lower().startswith('the'):
                    delete_from_best_constituents.append(u)

标签: pythonpandasdataframe

解决方案


推荐阅读