首页 > 解决方案 > 如何使用正则表达式从两个相似的 html 类元素中提取数据?

问题描述

如何使用 python 正则表达式从以下 html 片段中提取赞成票 ( 215 ) 和反对票 ( 82 ) 计数?

<span class="vote-actions">
    <a class="btn btn-default vote-action-good">
        <span class="icon thumb-up black black-hover">&nbsp;</span>
        <span class="rating-inbtn">215</span>
    </a>
    <a class="btn btn-default vote-action-bad">
        <span class="icon thumb-down grey black-hover">&nbsp;</span>
        <span class="rating-inbtn">82</span>
    </a>
</span>

我已经格式化了 html 代码,但原始代码中没有 '\n' 或 '\t' 字符。

仅供参考,我不期待任何漂亮的汤解决方案。Python Re 搜索功能是我正在寻找的。

标签: pythonregexweb-scraping

解决方案


要找到两个号码,我会做

text = '''<span class="vote-actions">
    <a class="btn btn-default vote-action-good">
        <span class="icon thumb-up black black-hover">&nbsp;</span>
        <span class="rating-inbtn">215</span>
    </a>
    <a class="btn btn-default vote-action-bad">
        <span class="icon thumb-down grey black-hover">&nbsp;</span>
        <span class="rating-inbtn">82</span>
    </a>
</span>'''

import re

a = re.findall('rating-inbtn">(\d+)', text)
print(a)

['215', '82']

在 HTML 中,我看到第一个数字是Up,第二个是Down,所以我不需要更好的方法。

up = a[0]
down = a[1]

如果还不够,我会使用 HTML 解析器

text = '''<span class="vote-actions">
    <a class="btn btn-default vote-action-good">
        <span class="icon thumb-up black black-hover">&nbsp;</span>
        <span class="rating-inbtn">215</span>
    </a>
    <a class="btn btn-default vote-action-bad">
        <span class="icon thumb-down grey black-hover">&nbsp;</span>
        <span class="rating-inbtn">82</span>
    </a>
</span>'''

import lxml.html

soup = lxml.html.fromstring(text)

up = soup.xpath('//a[@class="btn btn-default vote-action-good"]/span[@class="rating-inbtn"]')
up = up[0].text
print(up)

down = soup.xpath('//a[@class="btn btn-default vote-action-bad"]/span[@class="rating-inbtn"]')
down = down[0].text
print(down)

推荐阅读