首页 > 解决方案 > 尝试使用 BeautifulSoup 从没有 API 的站点获取数据

问题描述

所以,我正在制作一个刮板,它会从网站上刮取表格数据,然后将其上传到一个天蓝色的数据库中。我正在尝试使用 Beautiful Soup 来抓取数据。该网站是https://www.pgcb.org.bd/PGCB/?a=pages/hourly_generation_loadshed_display.php问题在于该网站的html代码很粗糙。

</div>
<!-- main container-->
<div class="grid_18" id="main_container">
<div style="padding-left: 10px; padding-top: 5px;"><img 
src="images/hgen&amp;loadshed.jpg"/></div>
<head>
<style>
        tr:nth-child(even){
            background-color: #ccc;
    }
    tr:hover
    {
        background: #f7dcdf;
    }
</style>
</head>
<table class="layout display responsive-table"><tr>
<th style="text-align: center;">Date</th>
<th style="text-align: center;">Time</th>
<th style="text-align: center;">Generation</th>
<th style="text-align: center;">Demand</th>
<th style="text-align: center;">Shortage</th>
<th style="text-align: center;">Loadshed</th>
<th style="text-align: center;">Remark</th>
</tr> <tr>
<td style="text-align: center;">10-10-2019</td>
<td style="text-align: center;">09:00:00</td>
<td style="text-align: center;">7600.4</td>
<td style="text-align: center;">7600</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;"></td>
</tr>
<tr>
<td style="text-align: center;">10-10-2019</td>
<td style="text-align: center;">08:00:00</td>
<td style="text-align: center;">7165.2</td>
<td style="text-align: center;">7165</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;">0</td>
<td style="text-align: center;"></td>
</tr>
<tr>

到目前为止,我已经尝试了以下内容,并得到了上述结果以及其他一些文本,我可以稍后将其删除。但是,我需要从日期时间获取文本

<td style="text-align: center;">10-10-2019</td>
<td style="text-align: center;">09:00:00</td>

以表格格式,例如,

日期 | 时间 |

10-10-2019 | 9:00:00|

这是我到目前为止所做的:

#import requests
from bs4 import BeautifulSoup as soup
from urllib.request import urlopen as uReq # webclient

#scrapping from
page_url = "https://www.pgcb.org.bd/PGCB/?a=pages/hourly_generation_loadshed_display.php"
uclient = uReq (page_url)

#parsing the html

page_soup = soup (uclient.read(), "html.parser")
uclient.close()

table1 = page_soup.findAll("table",{"class":"layout display responsive-table"})

请让我知道如何改进这一点并获得预期的结果。

标签: pythonhtmlpython-3.xweb-scrapingbeautifulsoup

解决方案


BeautifulSoup 是一个很棒的工具。但在这种特殊情况下,你可以用 beautifulsoup 做很长的路,或者每当我看到<table>标签时,我只使用 pandas.read_html()来完成工作(它在引擎盖下使用 BeautifulSoup),然后只需要稍微清理一下表格. 它将返回所有表格标签的列表。在这种情况下,有 2 个表格标签,您想要的表格在索引位置 1:

import pandas as pd

url = 'https://www.pgcb.org.bd/PGCB/?a=pages/hourly_generation_loadshed_display.php'

tables = pd.read_html(url)
df = tables[1]

df = df[:-1]
df = df.dropna(axis=1,how='all')

输出:

print (df.to_string())
          Date      Time Generation Demand Shortage Loadshed        Remark
0   10-10-2019  18:00:00       9182   9182        0        0           NaN
1   10-10-2019  17:00:00     8091.3   8091        0        0           NaN
2   10-10-2019  16:00:00     8277.7   8278        0        0           NaN
3   10-10-2019  15:00:00     8465.8   8466        0        0           NaN
4   10-10-2019  14:00:00     8394.7   8395        0        0           NaN
5   10-10-2019  13:00:00     8553.4   8553        0        0           NaN
6   10-10-2019  12:00:00       8376   8376        0        0      Day Peak
7   10-10-2019  11:00:00     8169.9   8170        0        0           NaN
8   10-10-2019  10:00:00     7900.9   7901        0        0           NaN
9   10-10-2019  09:00:00     7600.4   7600        0        0           NaN
10  10-10-2019  08:00:00     7165.2   7165        0        0           NaN
11  10-10-2019  07:00:00     6980.4   6980        0        0           NaN
12  10-10-2019  06:00:00     7017.1   7017        0        0           NaN
13  10-10-2019  05:00:00       7328   7328        0        0           NaN
14  10-10-2019  04:00:00       7504   7504        0        0           NaN
15  10-10-2019  03:00:00       7877   7877        0        0           NaN
16  10-10-2019  02:00:00       8071   8071        0        0           NaN
17  10-10-2019  01:00:00       8400   8400        0        0           NaN
18  09-10-2019  24:00:00       8847   8847        0        0           NaN
19  09-10-2019  23:00:00       9093   9093        0        0           NaN
20  09-10-2019  22:00:00       9483   9483        0        0           NaN
21  09-10-2019  21:00:00       9852   9852        0        0           NaN
22  09-10-2019  20:00:00      10284  10284        0        0  Evening Peak
23  09-10-2019  19:30:00      10229  10229        0        0           NaN
24  09-10-2019  19:00:00      10211  10211        0        0           NaN
25  09-10-2019  18:00:00       9538   9538        0        0           NaN

额外的

如果您想了解它如何与 BeautifulSoup 一起工作,请展示如何迭代。QHarr 还提供了另一种/更好的方法。

import pandas as pd
from bs4 import BeautifulSoup
import requests

url = 'https://www.pgcb.org.bd/PGCB/?a=pages/hourly_generation_loadshed_display.php'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

tables = soup.find_all('table')
table = tables[1]

headers = table.find_all('th')
columns = [ td.text for td in headers ]

df = pd.DataFrame()
rows = table.find_all('tr')
for row in rows:
    tds = row.find_all('td')
    data = [ td.text for td in tds ]
    temp_df = pd.DataFrame([data])

    df = df.append(temp_df, sort=True).reset_index(drop=True)

df = df.dropna(axis=1,how='all')
df = df.dropna(axis=0,how='all')
df.columns = columns
df = df[:-1]

推荐阅读