首页 > 解决方案 > 正则表达式字符串模式匹配

问题描述

我有一组看起来像这样的 https 链接

list1 = ['https://wvva.com/news/top-stories/2018/12/10/w-va-gov-appoints-former-beckley-council-member-to-parole-board/','https://www.starbreeze.com/2018/12/starbreeze-appoints-claes-wenthzel-as-acting-cfo/','https://www.streetinsider.com/corporate+news/perkinelmer+%28pki%29+appoints+prahlad+singh+as+president+%26+coo/']

我想过滤包含"appoints"作为一个必要关键字和'chief-operating-officer','ceo','chief-executive-officer','coo','cfo','chief-financial-officer','chief-marketing-officer','cmo','chief-technology-officer','cto'其他必要关键字的链接。我的意思是,如果链接包含“任命”一词以及上述任何一个词,例如 [cto、ceo、coo 等],则选择该链接。

我的示例输出将是这样的:

['https://www.starbreeze.com/2018/12/starbreeze-appoints-claes-wenthzel-as-acting-cfo/','https://www.streetinsider.com/corporate+news/perkinelmer+%28pki%29+appoints+prahlad+singh+as+president+%26+coo/']

非常感谢此问题的正则表达式。

标签: python

解决方案


您可以遍历关键字以在任何提供的链接中查找匹配的关键字

import re
from pprint import pprint

keywords = [
    'appoints',
    'chief-operating-officer',
    'ceo',
    'chief-executive-officer',
    'coo',
    'cfo',
    'chief-financial-officer',
    'chief-marketing-officer',
    'cmo',
    'chief-technology-officer',
    'cto',
]

links = [
    'https://wvva.com/news/top-stories/2018/12/10/w-va-gov-appoints-former-beckley-council-member-to-parole-board/',
    'https://www.starbreeze.com/2018/12/starbreeze-appoints-claes-wenthzel-as-acting-cfo/',
    'https://www.streetinsider.com/corporate+news/perkinelmer+%28pki%29+appoints+prahlad+singh+as+president+%26+coo/',
]

new_links = []

for link in links:
    for keyword in keywords:
        temp = re.search(r'' + keyword + '', link, flags=re.IGNORECASE)
        if temp and link not in new_links:
            new_links.append(link)

pprint(new_links)

推荐阅读