首页 > 解决方案 > 如何使用python脚本从标签列表中获取href?

问题描述

我有一个 html 文件 'links.html',我想从这个 html 文件中提取 href //www.medicineindia.org/medicine-brand- details/8414/capicare,它是用于字符串CAPICARE的。如何使用 python 脚本

“links.html”的代码是:

<a itemprop="url" href="//www.medicineindia.org/medicine-brand- 
details/12220/cholstig"><span itemprop="name">CHOLSTIG</span></a>
<a itemprop="url" href="//www.medicineindia.org/medicine-brand- 
details/8414/capicare"><span itemprop="name">CAPICARE</span></a>
<a itemprop="url" href="//www.medicineindia.org/medicine-brand- 
details/230/cyclozobid"><span itemprop="name">CYCLOZOBID</span></a>
<a itemprop="url" href="//www.medicineindia.org/medicine-brand- 
details/6855/cinkona"><span itemprop="name">CINKONA</span></a>

标签: htmlpython-3.x

解决方案


您可以通过利用捕获(和非捕获)组的“简单”正则表达式来完成该任务:

import re

html = ('<a itemprop="url" href="//www.medicineindia.org/medicine-brand'
        '-details/12220/cholstig"><span itemprop="name">CHOLSTIG</span></a><a '
        'itemprop="url" href="//www.medicineindia.org/medicine-brand-details'
        '/8414/capicare"><span itemprop="name">CAPICARE</span></a><a '
        'itemprop="url" href="//www.medicineindia.org/medicine-brand-details'
        '/230/cyclozobid"><span itemprop="name">CYCLOZOBID</span></a><a '
        'itemprop="url" href="//www.medicineindia.org/medicine-brand-details'
        '/6855/cinkona"><span itemprop="name">CINKONA</span></a>')

regex = '(?:href=")([^"]+)(?:.*?<span.*?>)(.*?)(?:</span>)'

matches = re.findall(regex, html)

for m in matches:
    print(f'Brand: {m[1]}, URL: {m[0]}')

这将输出以下内容:

Brand: CHOLSTIG, URL: //www.medicineindia.org/medicine-brand-details/12220/cholstig
Brand: CAPICARE, URL: //www.medicineindia.org/medicine-brand-details/8414/capicare
Brand: CYCLOZOBID, URL: //www.medicineindia.org/medicine-brand-details/230/cyclozobid
Brand: CINKONA, URL: //www.medicineindia.org/medicine-brand-details/6855/cinkona

这是对元组列表进行迭代的格式化输出matches,其中链接与其对应的“跨度”内容匹配。


推荐阅读