python - 从仅获取最后一行的表中提取数据
问题描述
我有这个网站的下表:
<table id="sample">
<tbody>
<tr class="toprow">
<td></td>
<td colspan="5">Number of Jurisdictions</td>
</tr>
<tr class="toprow">
<td>Region</td>
<td>Jurisdictions in the region</td>
<td>Jurisdictions that require IFRS Standards <br>
for all or most domestic publicly accountable entities</td>
<td>Jurisdictions that require IFRS Standards as % of total jurisdictions in the region</td>
<td>Jurisdictions that permit or require IFRS Standards for at least some (but not all or most) domestic publicly accountable entities</td>
<td>Jurisdictions that neither require nor permit IFRS Standards for any domestic publicly accountable entities</td>
</tr>
<tr>
<td class="leftcol">Europe</td>
<td class="data">44</td>
<td class="data">43</td>
<td class="data">98%</td>
<td class="data">1</td>
<td class="data">0</td>
</tr>
<tr>
<td class="leftcol">Africa</td>
<td class="data">23</td>
<td class="data">19</td>
<td class="data">83%</td>
<td class="data">1</td>
<td class="data">3</td>
</tr>
<tr>
<td class="leftcol">Middle East</td>
<td class="data">13</td>
<td class="data">13</td>
<td class="data">100%</td>
<td class="data">0</td>
<td class="data">0</td>
</tr>
<tr>
<td class="leftcol">Asia-Oceania</td>
<td class="data">33</td>
<td class="data">24</td>
<td class="data">73%</td>
<td class="data">3</td>
<td class="data">6</td>
</tr>
<tr>
<td class="leftcol">Americas</td>
<td class="data">37</td>
<td class="data">27</td>
<td class="data">73%</td>
<td class="data">8</td>
<td class="data">2</td>
</tr>
<tr>
<td class="leftcol" style="border-top:2px solid #000000"><strong>Totals</strong></td>
<td class="data" style="border-top:2px solid #000000"><strong>150</strong></td>
<td class="data" style="border-top:2px solid #000000"><strong>126</strong></td>
<td class="data" style="border-top:2px solid #000000"><strong>84%</strong></td>
<td class="data" style="border-top:2px solid #000000"><strong>13</strong></td>
<td class="data" style="border-top:2px solid #000000"><strong>11</strong></td>
</tr>
<tr>
<td class="leftcol"><strong>As % <br>
of 150</strong></td>
<td class="data"><strong>100%</strong></td>
<td class="data"><strong>84%</strong></td>
<td class="data"><strong> </strong></td>
<td class="data"><strong>9%</strong></td>
<td class="data"><strong>7%</strong></td>
</tr>
</tbody>
</table>
这是我的以下尝试:
from bs4 import BeautifulSoup
import requests
import pandas as pd
import requests
# Site URL
url = "http://archive.ifrs.org/Use-around-the-world/Pages/Analysis-of-the-IFRS-jurisdictional-profiles.aspx"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse HTML code for the entire site
soup = BeautifulSoup(html_content, "lxml")
# print(soup.prettify()) # print the parsed data of html
# On site there are 3 tables with the class "wikitable"
# The following line will generate a list of HTML content for each table
gdp = soup.select("table#sample")[0]
rows = []
cols = []
for g in gdp.select('tr.toprow'):
for c in g.select('td'):
cols.append(c.text)
for g in gdp.select('tr:not(.toprow)'):
row = []
for item in g.select('td'):
row.append(item.text)
rows.append(row)
pd.DataFrame(rows, columns=cols)
问题是 cols 得到了正确的结果:
['', 'Number of Jurisdictions', 'Region', 'Jurisdictions in the region', 'Jurisdictions that require IFRS\xa0Standards\xa0\r\n
for all or most domestic publicly accountable entities', 'Jurisdictions that require IFRS Standards\xa0as % of total jurisdictions in the region', 'Jurisdictions that permit or require IFRS\xa0Standards for at least some (but not all or most) domestic publicly accountable entities', 'Jurisdictions that neither require nor permit IFRS Standards for any domestic publicly accountable entities']
问题出在行上,它让我只有最后一行:
['As % \r\n of 150', '100%', '84%', '\xa0', '9%', '7%']
我收到此错误:
ValueError:传递了 8 列,传递的数据有 6 列
解决方案
有两个带有 .toprow 的 tr,跳过第一个 .toprow
for g in gdp.select('tr.toprow')[1:]:
您的解决方案将如下所示:
from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, "lxml")
gdp = soup.select("table#sample")[0]
rows = []
cols = []
for g in gdp.select('tr.toprow')[1:]:
for c in g.select('td'):
cols.append(c.text)
for g in gdp.select('tr:not(.toprow)'):
row = []
for item in g.select('td'):
row.append(item.text)
rows.append(row)
pd.DataFrame(rows, columns=cols)
推荐阅读
- flutter - 无法从 Flutter Web 读取 .txt 文件
- r - 如何按组记录观察的第一个实例?
- python - 不匹配的输入 ',' 期望 ')'
- css - With Bootstrap 4 I am able to fix top navbar on scrolling how can i make one more line below it fixed even in normal css
- powerquery - Powerquery,在示例中需要访问行时,可以创建参数化函数吗?
- mesos - Apache Mesos/Chronos 任务状态未更新并卡在 RUNNING 状态
- python - 如果地理坐标位于国家/地区 shapefile 之外,则删除 Pandas 数据框行
- python - 文档中的这个校验和是错误的还是我没有看到什么?
- java - ResponseEntity - 如何处理不同于 200 ok 的状态码?
- css - 我如何设计风格
像我启用的滑块一样?