首页 > 解决方案 > 从仅获取最后一行的表中提取数据

问题描述

我有这个网站的下表:

<table id="sample">
    <tbody>
        <tr class="toprow">
            <td></td>
            <td colspan="5">Number of Jurisdictions</td>
        </tr>
        <tr class="toprow">
            <td>Region</td>
            <td>Jurisdictions in the region</td>
            <td>Jurisdictions that require IFRS&nbsp;Standards&nbsp;<br>
            for all or most domestic publicly accountable entities</td>
            <td>Jurisdictions that require IFRS Standards&nbsp;as % of total jurisdictions in the region</td>
            <td>Jurisdictions that permit or require IFRS&nbsp;Standards for at least some (but not all or most) domestic publicly accountable entities</td>
            <td>Jurisdictions that neither require nor permit IFRS Standards for any domestic publicly accountable entities</td>
        </tr>
        <tr>
            <td class="leftcol">Europe</td>
            <td class="data">44</td>
            <td class="data">43</td>
            <td class="data">98%</td>
            <td class="data">1</td>
            <td class="data">0</td>
        </tr>
        <tr>
            <td class="leftcol">Africa</td>
            <td class="data">23</td>
            <td class="data">19</td>
            <td class="data">83%</td>
            <td class="data">1</td>
            <td class="data">3</td>
        </tr>
        <tr>
            <td class="leftcol">Middle East</td>
            <td class="data">13</td>
            <td class="data">13</td>
            <td class="data">100%</td>
            <td class="data">0</td>
            <td class="data">0</td>
        </tr>
        <tr>
            <td class="leftcol">Asia-Oceania</td>
            <td class="data">33</td>
            <td class="data">24</td>
            <td class="data">73%</td>
            <td class="data">3</td>
            <td class="data">6</td>
        </tr>
        <tr>
            <td class="leftcol">Americas</td>
            <td class="data">37</td>
            <td class="data">27</td>
            <td class="data">73%</td>
            <td class="data">8</td>
            <td class="data">2</td>
        </tr>
        <tr>
            <td class="leftcol" style="border-top:2px solid #000000"><strong>Totals</strong></td>
            <td class="data" style="border-top:2px solid #000000"><strong>150</strong></td>
            <td class="data" style="border-top:2px solid #000000"><strong>126</strong></td>
            <td class="data" style="border-top:2px solid #000000"><strong>84%</strong></td>
            <td class="data" style="border-top:2px solid #000000"><strong>13</strong></td>
            <td class="data" style="border-top:2px solid #000000"><strong>11</strong></td>
        </tr>
        <tr>
            <td class="leftcol"><strong>As % <br>
            of 150</strong></td>
            <td class="data"><strong>100%</strong></td>
            <td class="data"><strong>84%</strong></td>
            <td class="data"><strong>&nbsp;</strong></td>
            <td class="data"><strong>9%</strong></td>
            <td class="data"><strong>7%</strong></td>
        </tr>
    </tbody>
</table>

这是我的以下尝试:

from bs4 import BeautifulSoup
import requests
import pandas as pd
import requests
# Site URL
url = "http://archive.ifrs.org/Use-around-the-world/Pages/Analysis-of-the-IFRS-jurisdictional-profiles.aspx"
# Make a GET request to fetch the raw HTML content
html_content = requests.get(url).text
# Parse HTML code for the entire site
soup = BeautifulSoup(html_content, "lxml")
# print(soup.prettify()) # print the parsed data of html
# On site there are 3 tables with the class "wikitable"
# The following line will generate a list of HTML content for each table
gdp = soup.select("table#sample")[0]
rows = []
cols = []
for g in gdp.select('tr.toprow'):
    for c in g.select('td'):
        cols.append(c.text)
    
for g in gdp.select('tr:not(.toprow)'):
    row = []
    for item in g.select('td'):
        row.append(item.text)
    rows.append(row)
pd.DataFrame(rows, columns=cols)

问题是 cols 得到了正确的结果:

['', 'Number of Jurisdictions', 'Region', 'Jurisdictions in the region', 'Jurisdictions that require IFRS\xa0Standards\xa0\r\n        
    for all or most domestic publicly accountable entities', 'Jurisdictions that require IFRS Standards\xa0as % of total jurisdictions in the region', 'Jurisdictions that permit or require IFRS\xa0Standards for at least some (but not all or most) domestic publicly accountable entities', 'Jurisdictions that neither require nor permit IFRS Standards for any domestic publicly accountable entities']  

问题出在行上,它让我只有最后一行:

['As % \r\n            of 150', '100%', '84%', '\xa0', '9%', '7%']

我收到此错误:

ValueError:传递了 8 列,传递的数据有 6 列

标签: python

解决方案


有两个带有 .toprow 的 tr,跳过第一个 .toprow

for g in gdp.select('tr.toprow')[1:]:

您的解决方案将如下所示:

from bs4 import BeautifulSoup
soup = BeautifulSoup(html_content, "lxml")
gdp = soup.select("table#sample")[0]
rows = []
cols = []
for g in gdp.select('tr.toprow')[1:]:
    for c in g.select('td'):
        cols.append(c.text)
    
for g in gdp.select('tr:not(.toprow)'):
    row = []
    for item in g.select('td'):
        row.append(item.text)
    rows.append(row)
pd.DataFrame(rows, columns=cols)

推荐阅读