首页 > 解决方案 > Python webscraping tables with multiple header rows

问题描述

I am working through an issue with scraping a webtable using python. I have been scraping what I would call 'standard' tables for a while and I feel like I understand that reasonably well. I define a standard table as having a structure like:

<table>
<tr class="row-class">
  <th>Bill</th>
  <td>1</td>
  <td>2</td>
  <td>3</td>
  <td>4</td>
</tr>
<tr class="row-class">
  <th>Ben</th>
  <td>2</td>
  <td>3</td>
  <td>4</td>
  <td>1</td>
</tr>
<tr class="row-class">
  <th>Barry</th>
  <td>3</td>
  <td>4</td>
  <td>1</td>
  <td>2</td>
</tr>
</table>

I have now come across a table instance which has a slightly different structure and I can't figure out how to get the data out of it in the format I need. The format I am now trying to scrape is:

<table>
<tr class="row-class">
  <th>Bill</th></tr>
  <tr><td>1</td>
  <td>2</td>
  <td>3</td>
  <td>4</td>
</tr>
<tr class="row-class">
  <th>Ben</th></tr>
  <tr>
  <td>2</td>
  <td>3</td>
  <td>4</td>
  <td>1</td>
</tr>
<tr class="row-class">
  <th>Barry</th></tr>
  <tr>
  <td>3</td>
  <td>4</td>
  <td>1</td>
  <td>2</td>
</tr>
</table>

The output I am trying to achieve is:

Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2

I assume the problem I am encountering is that because the header is stored in a separate tr row, I only get an output of:

Bill
Ben
Barry

I am wondering if the solution is to traverse the rows and determine if the next tag is a th or td and then perform an appropriate action? I'd appreciate any advice on how the code I am using to test this could be modified to achieve the desired output. The code is:

from bs4 import BeautifulSoup

t_obj = """<tr class="row-class">
  <th>Bill</th></tr>
  <tr><td>1</td>
  <td>2</td>
  <td>3</td>
  <td>4</td>
</tr>
<tr class="row-class">
  <th>Ben</th></tr>
  <tr>
  <td>2</td>
  <td>3</td>
  <td>4</td>
  <td>1</td>
</tr>
<tr class="row-class">
  <th>Barry</th></tr>
  <tr>
  <td>3</td>
  <td>4</td>
  <td>1</td>
  <td>2</td>
</tr>"""


soup = BeautifulSoup(t_obj)

trs = soup.find_all("tr", {"class":"row-class"})

for tr in trs:
    for th in tr.findAll('th'):
        print (th.get_text())
        for td in tr.findAll('td'):
            print(td.get_text())
            print(td.get_text())

标签: pythonweb-scrapingbeautifulsoup

解决方案


在这里,我使用 3 种方法将两个<tr>标签配对在一起:

  • 第一种方法是使用zip()和 CSS 选择器
  • 第二种方法是使用 BeautifulSoup 的方法find_next_sibling()
  • 第三种方法是使用zip()自定义步骤进行简单切片

from bs4 import BeautifulSoup

t_obj = """<tr class="row-class">
  <th>Bill</th></tr>
  <tr><td>1</td>
  <td>2</td>
  <td>3</td>
  <td>4</td>
</tr>
<tr class="row-class">
  <th>Ben</th></tr>
  <tr>
  <td>2</td>
  <td>3</td>
  <td>4</td>
  <td>1</td>
</tr>
<tr class="row-class">
  <th>Barry</th></tr>
  <tr>
  <td>3</td>
  <td>4</td>
  <td>1</td>
  <td>2</td>
</tr>"""


soup = BeautifulSoup(t_obj, 'html.parser')

for tr1, tr2 in zip(soup.select('tr.row-class'), soup.select('tr.row-class ~ tr:not(.row-class)')):
    print( ','.join(tag.get_text() for tag in tr1.select('th') + tr2.select('td')) )

print()

for tr in soup.select('tr.row-class'):
    print( ','.join(tag.get_text() for tag in tr.select('th') + tr.find_next_sibling('tr').select('td')) )

print()

trs = soup.select('tr')
for tr1, tr2 in zip(trs[::2], trs[1::2]):
    print( ','.join(tag.get_text() for tag in tr1.select('th') + tr2.select('td')) )

印刷:

Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2

Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2

Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2

推荐阅读