首页 > 解决方案 > 用于从 dict 中提取所有 url 的正则表达式,如字符串

问题描述

这是我必须从中提取网址的字符串

s = "'0352442':{url:'https://www.riteaid.com/shop/nexium-24hr-42-ct-capsules-0352442'},'0370009':{url:'https://www.riteaid.com/shop/rite-aid-pharmacy-epsom-salt-first-aid-6-lb-2-72-kg-0370009'},'0303249':{url:'https://www.riteaid.com/shop/huggies-natural-care-unscented-baby-wipes-soft-pack-56-count-0303249'},'0398568':{url:'https://www.riteaid.com/shop/rite-aid-sterile-pads-4-x4-25-ea-0398568'},}"

到目前为止,我尝试的代码仅打印

urls = re.findall('https?://(?:[-\w.]|(?:%[\da-fA-F]{2}))+', s)

但它只打印此网址的重复

    ['https://www.riteaid.com']

标签: pythonregexpython-3.xpython-2.7list-comprehension

解决方案


正如您提到的 dict 之类的字符串,您必须针对您的特定情况使用正则表达式,这可以使用。

s = "'0352442':{url:'https://www.riteaid.com/shop/nexium-24hr-42-ct-capsules-0352442'},'0370009':{url:'https://www.riteaid.com/shop/rite-aid-pharmacy-epsom-salt-first-aid-6-lb-2-72-kg-0370009'},'0303249':{url:'https://www.riteaid.com/shop/huggies-natural-care-unscented-baby-wipes-soft-pack-56-count-0303249'},'0398568':{url:'https://www.riteaid.com/shop/rite-aid-sterile-pads-4-x4-25-ea-0398568'},}"

urls = re.findall(r"url:'(https?://.*?)'}", s)

result:
['https://www.riteaid.com/shop/nexium-24hr-42-ct-capsules-0352442',
 'https://www.riteaid.com/shop/rite-aid-pharmacy-epsom-salt-first-aid-6-lb-2-72-kg-0370009',
 'https://www.riteaid.com/shop/huggies-natural-care-unscented-baby-wipes-soft-pack-56-count-0303249',
 'https://www.riteaid.com/shop/rite-aid-sterile-pads-4-x4-25-ea-0398568']

解释

url:'(http : 文字串

年代?: 可选文字字符“s”

.*? :非贪婪的任何角色。

'}: : 文字字符串


推荐阅读