首页 > 解决方案 > 发现

问题描述

所以我编写了一个脚本来从网站上抓取表格并将它们保存到 Excel 工作表中:

import pandas as pd
import requests
from bs4 import BeautifulSoup
from pandas import ExcelWriter
import os.path
path = "C:...."
url= 'https://zoek.officielebekendmakingen.nl/kst-35570-2.html'
page = requests.get(url)
soup = BeautifulSoup(page.text, 'html.parser')

tables_df = pd.read_html(url, attrs = {'class': 'kio2 portrait'})

tables = soup.find_all('table', class_="kio2 portrait")

titles = []
for table in tables:
    print(table)
    title = table.find_all("caption", class_="table-title")
    titles.append(title)
titles = []

writer = pd.ExcelWriter('output.xlsx')
for i, df in enumerate(tables_df, 1):
    df.to_excel(writer, index=True,sheet_name=f'sheetName_{i}')
writer.save()

哪个有效,但现在我想找到这些表的所有标题,这样我就可以给每张表这个标题。例如,第一个表有以下我感兴趣的文本:

<table cellpadding="0" cellspacing="0" class="kio2 portrait" summary="Tabel 1.1 Budgettaire kerngegevens"><caption class="table-title">Tabel 1.1 Budgettaire kerngegevens</caption>

现在我想刮掉 and 之间的<caption class="table-title">部分</caption>。或者,这也是一种可能,使用摘要元素。我怎样才能做到这一点?我已经在代码中尝试过了,但我还没有找到任何东西。

标签: pythonhtmlpandas

解决方案


尝试:

import requests
import pandas as pd
from bs4 import BeautifulSoup
from pandas import ExcelWriter


url = "https://zoek.officielebekendmakingen.nl/kst-35570-2.html"
soup = BeautifulSoup(requests.get(url).text, "html.parser")

writer = pd.ExcelWriter("output.xlsx")
for i, table in enumerate(soup.find_all("table", class_="kio2 portrait"), 1):
    df = pd.read_html(str(table))[0]

    caption = table.get("summary", "").replace(":", "").strip()
    # some tables doesn't contain summary, so make generic sheet name:
    if not caption:
        caption = f"table {i}"

    df.to_excel(writer, sheet_name=caption)

writer.save()

这将创建output.xlsx185 张纸(至少在我的 Libreoffice 中打开它):

在此处输入图像描述


推荐阅读