首页 > 解决方案 > 如果单元格满足特定条件,则从行中提取特定值

问题描述

我有这个 HTML 文件,它是从具有财务数据的网站获得的。

    <table class="tableFile2" summary="Results">
     <tr>
      <td nowrap="nowrap">
       13F-HR
      </td>
      <td nowrap="nowrap">
       <a href="URL" id="documentsbutton">
        Documents
       </a>
      </td>
      <td>
       2019-05-15
      </td>
      <td nowrap="nowrap">
       <a href="URL">
        028-10098
       </a>
       <br/>
       19827821
      </td>
     </tr>
     <tr class="blueRow">
      <td nowrap="nowrap">
       13F-HR
      </td>
      <td nowrap="nowrap">
       <a href="URL" id="documentsbutton">
        Documents
       </a>
      </td>
      <td>
       2019-02-14
      </td>
      <td nowrap="nowrap">
       <a href="URL">
        028-10098
       </a>
       <br/>
       19606811
      </td>
     </tr>
     <tr>
      <td nowrap="nowrap">
       SC 13G/A
      </td>
      <td nowrap="nowrap">
       <a href="URL" id="documentsbutton">
        Documents
       </a>
      </td>
      <td>
       2019-02-13
      </td>
      <td>
      </td>
     </tr>
     <tr class="blueRow">
      <td nowrap="nowrap">
       SC 13G/A
      </td>
      <td nowrap="nowrap">
       <a href="URL" id="documentsbutton">
        Documents
       </a>
      </td>
      <td>
       2019-02-13
      </td>
      <td>
      </td>
     </tr>
     <tr>
      <td nowrap="nowrap">
       SC 13G/A
      </td>
      <td nowrap="nowrap">
       <a href="URL" id="documentsbutton">
        Documents
       </a>
      </td>
      <td>
       2019-02-13
      </td>
      <td>
      </td>
     </tr>
    </table>

我试图仅提取其中一个单元格包含单词13F的行。获得正确的行后,我希望能够将日期和 href 保存到列表中以供以后处理。目前,我设法构建了我的刮板以成功定位特定表,但我无法根据我的条件过滤特定行。目前,当我尝试添加条件时,它似乎忽略了它并继续包含所有行。

r = requests.get(url)
soup = BeautifulSoup(open("data/testHTML.html"), 'html.parser')

table = soup.find('table', {"class": "tableFile2"})
rows = table.findChildren("tr")
for row in rows:
    cell = row.findNext("td")
    if cell.text.find('13F'):
        print(row)

理想情况下,我试图获得与此类似的输出

[13F-HR, URL, 2019-05-15],[13F-HR, URL, 2019-02-14]

标签: pythonbeautifulsoup

解决方案


优化解决方案:

...

for tr in soup.select('table.tableFile2 tr'):
    tds = tr.findChildren('td')
    if '13F' in tds[0].text:
        print([td.text.strip() for td in tds[:3]])

输出:

['13F-HR', 'Documents', '2019-05-15']
['13F-HR', 'Documents', '2019-02-14']

推荐阅读