python - Python webscraping tables with multiple header rows
问题描述
I am working through an issue with scraping a webtable using python. I have been scraping what I would call 'standard' tables for a while and I feel like I understand that reasonably well. I define a standard table as having a structure like:
<table>
<tr class="row-class">
<th>Bill</th>
<td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr class="row-class">
<th>Ben</th>
<td>2</td>
<td>3</td>
<td>4</td>
<td>1</td>
</tr>
<tr class="row-class">
<th>Barry</th>
<td>3</td>
<td>4</td>
<td>1</td>
<td>2</td>
</tr>
</table>
I have now come across a table instance which has a slightly different structure and I can't figure out how to get the data out of it in the format I need. The format I am now trying to scrape is:
<table>
<tr class="row-class">
<th>Bill</th></tr>
<tr><td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr class="row-class">
<th>Ben</th></tr>
<tr>
<td>2</td>
<td>3</td>
<td>4</td>
<td>1</td>
</tr>
<tr class="row-class">
<th>Barry</th></tr>
<tr>
<td>3</td>
<td>4</td>
<td>1</td>
<td>2</td>
</tr>
</table>
The output I am trying to achieve is:
Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2
I assume the problem I am encountering is that because the header is stored in a separate tr row, I only get an output of:
Bill
Ben
Barry
I am wondering if the solution is to traverse the rows and determine if the next tag is a th or td and then perform an appropriate action? I'd appreciate any advice on how the code I am using to test this could be modified to achieve the desired output. The code is:
from bs4 import BeautifulSoup
t_obj = """<tr class="row-class">
<th>Bill</th></tr>
<tr><td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr class="row-class">
<th>Ben</th></tr>
<tr>
<td>2</td>
<td>3</td>
<td>4</td>
<td>1</td>
</tr>
<tr class="row-class">
<th>Barry</th></tr>
<tr>
<td>3</td>
<td>4</td>
<td>1</td>
<td>2</td>
</tr>"""
soup = BeautifulSoup(t_obj)
trs = soup.find_all("tr", {"class":"row-class"})
for tr in trs:
for th in tr.findAll('th'):
print (th.get_text())
for td in tr.findAll('td'):
print(td.get_text())
print(td.get_text())
解决方案
在这里,我使用 3 种方法将两个<tr>
标签配对在一起:
- 第一种方法是使用
zip()
和 CSS 选择器 - 第二种方法是使用 BeautifulSoup 的方法
find_next_sibling()
- 第三种方法是使用
zip()
自定义步骤进行简单切片
from bs4 import BeautifulSoup
t_obj = """<tr class="row-class">
<th>Bill</th></tr>
<tr><td>1</td>
<td>2</td>
<td>3</td>
<td>4</td>
</tr>
<tr class="row-class">
<th>Ben</th></tr>
<tr>
<td>2</td>
<td>3</td>
<td>4</td>
<td>1</td>
</tr>
<tr class="row-class">
<th>Barry</th></tr>
<tr>
<td>3</td>
<td>4</td>
<td>1</td>
<td>2</td>
</tr>"""
soup = BeautifulSoup(t_obj, 'html.parser')
for tr1, tr2 in zip(soup.select('tr.row-class'), soup.select('tr.row-class ~ tr:not(.row-class)')):
print( ','.join(tag.get_text() for tag in tr1.select('th') + tr2.select('td')) )
print()
for tr in soup.select('tr.row-class'):
print( ','.join(tag.get_text() for tag in tr.select('th') + tr.find_next_sibling('tr').select('td')) )
print()
trs = soup.select('tr')
for tr1, tr2 in zip(trs[::2], trs[1::2]):
print( ','.join(tag.get_text() for tag in tr1.select('th') + tr2.select('td')) )
印刷:
Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2
Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2
Bill,1,2,3,4
Ben,2,3,4,1
Barry,3,4,1,2
推荐阅读
- python - Python - 创建一个检查列表列表中的新元素的函数
- php - 如何从 Lumen 连接到 Firebird 数据库?
- python - 在 Python for 循环中使用 %s
- python - 标签和列表视图未显示在小部件上
- python - 在 Google Colab 上运行 autokeras 图像分类器教程时出错
- swift - macCatalyst 应用程序:如何在不终止应用程序的情况下关闭窗口?
- oracle - Oracle DB with .NET Core Using ODP.Net Provider - 如何设置架构
- javascript - pouchDB 和按日期或日期范围查询
- microsoft-graph-api - Microsoft Graph 查询中的不同值
- java - Android 动画布局和视图