首页 > 解决方案 > 查找未返回值的下一个兄弟姐妹。如何在没有其余类的情况下提取我需要的两个类?

问题描述

我想从下面的“内容”中提取物品重量和产品尺寸。我在这里想念什么?在我的脚本中,找不到我要查找的内容。有没有更简单的方法来提取物品重量和产品尺寸?谢谢

import bs4 as bs

content = '''
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Item Weight
</th>
<td class="a-size-base prodDetAttrValue">
0.16 ounces
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Product Dimensions
</th>
<td class="a-size-base prodDetAttrValue">
4.8 x 3.4 x 0.5 inches
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Batteries Included?
</th>
<td class="a-size-base prodDetAttrValue">
No
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Batteries Required?
</th>
<td class="a-size-base prodDetAttrValue">
No
</td>
</tr>
'''
soup = bs.BeautifulSoup(content, features='lxml')


try:
    product = {
        'weight': soup.find(text='Item Weight').parent.find_next_siblings(),
        'dimension': soup.find(text='Product Dimensions').parent.find_next_siblings()
    }
except:
    product = {
        'weight': 'item unavailable',
        'dimension': 'item unavailable'
    }
print(product)

追溯:

{'weight': 'item unavailable', 'dimension': 'item unavailable'}

标签: pythonhtmlbeautifulsouppython-requestsfind

解决方案


您错误地使用了查找下一个兄弟姐妹。td标签是标签的兄弟而th不是父tr标签。

from bs4 import BeautifulSoup
import re

content = '''
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Item Weight
</th>
<td class="a-size-base prodDetAttrValue">
0.16 ounces
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Product Dimensions
</th>
<td class="a-size-base prodDetAttrValue">
4.8 x 3.4 x 0.5 inches
</td>
</tr>
<tr>
<th class="a-color-secondary a-size-base prodDetSectionEntry">
Batteries Included?
</th>
<td class="a-size-base prodDetAttrValue">
No
</td>
</tr>
'''

soup = BeautifulSoup(content, 'html.parser')
d = {
  'weight': soup.find('th', text=re.compile('\s*Item Weight\s*')).find_next_sibling('td').text.strip(), 
  'dimension': soup.find('th', text=re.compile('\s*Product Dimensions\s*')).find_next_sibling('td').text.strip()
  }

print(d)

推荐阅读