首页 > 解决方案 > 如何提取不在标签内的2个不同的封闭html标签之间的文本?

问题描述

在具有许多具有相同类名的 b 标签的网页上,我想提取 2 个不同的封闭 html 'b' 标签之间的文本,特别是这些 b 标签

 <b style="display:block">Print Method:</b>
 "
                                On-demand inkjet (piezoelectric)"
<b style="display:block">Minimum Ink Droplet Volume:</b>

我尝试使用漂亮的汤库通过创建一个表来获取数据,使用findALL.

b.text

它打印所有 b 标签中的所有文本,无论如何我只能得到这些标签之间的文本。

这是我从中获取 HTML 的网站。

标签: pythonhtmlweb-scraping

解决方案


见下文(请注意,该代码效率不高,因为它扫描文档中的每个条目)

from bs4 import BeautifulSoup

html = ''' <b style="display:block">Print Method:</b>
 "
                                On-demand inkjet (piezoelectric)"
<b style="display:block">Minimum Ink Droplet Volume:</b>'''

soup = BeautifulSoup(html, 'html.parser')
idx_lst = []
data_idx = -1
for idx, entry in enumerate(soup.contents):
    if entry.name == 'b':
        idx_lst.append(idx)
        if len(idx_lst) == 2:
            if idx_lst[1] - idx_lst[0] == 2:
                data_idx = idx_lst[0] + 1
                break
            else:
                idx_lst = []

if data_idx != -1:
    print(soup.contents[data_idx])

输出

 "
                                On-demand inkjet (piezoelectric)"

下面的代码处理真正的 HTML

import requests
from bs4 import BeautifulSoup

URL = 'https://www.epson.co.in/For-Home/Printers/EcoTank-Printers/EcoTank-L1110-Single-function-InkTank-Printer/p/C11CG89504'

findings = set()
r = requests.get(URL)
if r.status_code == 200:
    soup = BeautifulSoup(r.text, 'html.parser')
    idx_lst = []
    data_idx = -1
    b_lst = soup.find_all('b', style='display:block')
    for entry in b_lst:
        for idx, x in enumerate(entry.parent.contents):
            if x.name == 'b' and idx not in idx_lst:
                idx_lst.append(idx)
            if len(idx_lst) == 2:
                if idx_lst[1] - idx_lst[0] == 2 or idx_lst[1] - idx_lst[0] == 3:
                    data_idx = idx_lst[0] + 1
                    findings.add(entry.parent.contents[data_idx].strip())
                    idx_lst = []
                else:
                    idx_lst = []

for idx, p in enumerate(findings, 1):
    print('{}) {}'.format(idx, p))

输出

1) 215.9 x 1200 mm (8.5 x 47.24")
2) 1
3) ESC / P-R
4) 5760 x 1440 dpi (with Variable-Sized Droplet Technology)
5) Friction feed
6) Sound Power Level (Black / Colour): 6.6 B(A) / 6.3 B(A)
7) 180 nozzles Black, 59 nozzles per colour (Cyan, Magenta, Yellow)
8) On-demand inkjet (piezoelectric)
9) Bi-directional printing
10) Up to 33 ppm / 15 ppm
11) Legal, Indian-Legal (215 x 345 mm), 8.5 x 13", Letter, A4, 16K (195 x 270 mm), B5, A5, B6, A6, Hagaki (100 x 148 mm), 5 x 7", 4 x 6", Envelopes: #10, DL, C6
12) 3 pl

推荐阅读