首页 > 解决方案 > 在 BeautifulSoup 中查找没有属性的兄弟标签

问题描述

抱歉,这是一个关于 BeatifulSoup 的初学者问题,但我找不到答案。

我无法弄清楚如何抓取没有属性的 HTML 标签。

这是代码部分。

<tr bgcolor="#ffffff">
 <td>
  No-Lobbying List
 </td>
 <tr bgcolor="#efefef">
  <td rowspan="2" valign="top">
   6/24/2019
  </td>
  <td>
   <a href="document.cfm?id=322577" target="_blank">
    Brian Manley, Chief of Police, Austin Police Department
   </a>
   <a href="document.cfm?id=322577" target="_blank">
    <img alt="Click here to download the PDF document" border="0"     height="16"     src="https://assets.austintexas.gov/edims/images/pdf_icon.gif"     width="16"/>
   </a>
  </td>
  <tr bgcolor="#efefef">
   <td>
    Preliminary 2018 Annual Crime Report - Executive Summary
   </td>
  </tr>
 </tr>
</tr>

如何导航到带有文本“初步 2018 年度犯罪报告 - 执行摘要”的标签?

我曾尝试从具有属性的 a 移动并使用 .next_sibling,但我失败得很惨。

谢谢你。

trgrewy = soup.findAll('tr', {'bgcolor':'#efefef'}) #the cells alternate colors
trwhite = soup.findAll('tr', {'bgcolor':'#ffffff'}) 
trs = trgrewy + trwhite #merge them into a list
for item in trs:
    mdate = item.find('td', {'rowspan':'2'}) #find if it's today's date
    if mdate:
        datetime_object = datetime.strptime(mdate.text, '%m/%d/%Y')
        if datetime_object.date() == now.date():
            sender = item.find('a').text
            pdf = item.find('a')['href']
            link = baseurl + pdf
            title = item.findAll('td')[2] #this is where i've failed

标签: pythonpython-3.xbeautifulsoup

解决方案


您可以使用 CSS 选择器:

data = '''
<tr bgcolor="#ffffff">
 <td>
  No-Lobbying List
 </td>
 <tr bgcolor="#efefef">
  <td rowspan="2" valign="top">
   6/24/2019
  </td>
  <td>
   <a href="document.cfm?id=322577" target="_blank">
    Brian Manley, Chief of Police, Austin Police Department
   </a>
   <a href="document.cfm?id=322577" target="_blank">
    <img alt="Click here to download the PDF document" border="0"     height="16"     src="https://assets.austintexas.gov/edims/images/pdf_icon.gif"     width="16"/>
   </a>
  </td>
  <tr bgcolor="#efefef">
   <td>
    Preliminary 2018 Annual Crime Report - Executive Summary
   </td>
  </tr>
 </tr>
</tr>'''

from bs4 import BeautifulSoup

soup = BeautifulSoup(data, 'lxml')

# This will find date
print(soup.select_one('td[rowspan="2"]').get_text(strip=True))

# This will find next row after the row with date
print(soup.select_one('tr:has(td[rowspan="2"]) + tr').get_text(strip=True))

印刷:

6/24/2019
Preliminary 2018 Annual Crime Report - Executive Summary

进一步阅读:

CSS 选择器参考


推荐阅读