python-3.x - 循环页面并将详细内容保存为 Python 中的数据框

问题描述

假设我需要从此链接中抓取详细内容：

目标是从链接中提取元素的内容，并将所有条目附加为数据框。

from bs4 import BeautifulSoup
import requests
import os
from urllib.parse import urlparse

url = 'http://www.jscq.com.cn/dsf/zc/cjgg/202101/t20210126_30144.html'
r = requests.get(url)
soup = BeautifulSoup(r.content, "html.parser")
text = soup.find_all(text=True)

output = ''
blacklist = [
    '[document]',
    'noscript',
    'header',
    'html',
    'meta',
    'head', 
    'input',
    'script'
]

for t in text:
    if t.parent.name not in blacklist:
        output += '{} '.format(t)
print(output)

出去：

南京市玄武区锁金村10-30号房屋公开招租成交公告-成交公告-江苏产权市场 
body{font-size:100%!important;}
.main_body{position:relative;width:1000px;margin:0 auto;background-color:#fff;}
.main_content_p img{max-width:90%;display:block;margin:0 auto;}
.m_con_r_h{padding-left: 20px;width: 958px;height: 54px;line-height: 55px;font-size: 12px;color: #979797;}
.m_con_r_h a{color: #979797;}
.main_content_p{min-height:200px;width:90%;margin:0 auto;line-height: 30px;text-indent:0;}
.main_content_p table{margin:0 auto!important;width:900px!important;}
.main_content_h1{border:none;width:93%;margin:0 auto;}
.tit_h{font-size:22px;font-family:'微软雅黑';color:#000;line-height:30px;margin-bottom:10px;padding-bottom:20px;text-align:center;}
.doc_time{font-size:12px;color:#555050;height:28px;line-height:28px;text-align:center;background:#F2F7FD;border-top:1px solid #dadada;}
.doc_time span{padding:0 5px;}
.up_dw{width:100%;border-top:1px solid #ccc;padding-top:10px;padding-bottom:10px;margin-top:30px;clear:both;}
.pager{width:50%;float:left;padding-left:0;text-align:center;}

.bshare-custom{position:absolute;top:20px;right:40px;}
.pager{width:90%;padding-left: 50px;float:inherit;text-align: inherit;}
 页头部分开始 
 页头部分结束 
  START body  
 南京市玄武区锁金村10-30号房屋公开招租成交公告 
 组织机构：江苏省产权交易所 
 发布时间：2021-01-26  
 项目编号 
 17FCZZ20200125 
 转让/出租标的名称 
 南京市玄武区锁金村10-30号房屋公开招租 
 转让方/出租方名称 
 南京邮电大学资产经营有限责任公司 
 转让标的评估价/年租金评估价（元） 
 64800.00 
 转让底价/年租金底价（元） 
 97200.00 
 受让方/承租方名称 
 马尕西木 
 成交价/成交年租金（元） 
 97200.00 
 成交日期 
 2021年01月15日 
 附件： 
  END body  
 页头部分开始 
 页头部分结束

但是我如何循环所有页面并提取内容，并将它们附加到以下数据框中？谢谢。

附加dfs为数据框的更新：

updated_df = pd.DataFrame()

with requests.Session() as connection_session:  # reuse your connection!
    for follow_url in get_follow_urls(get_main_urls(), connection_session):
        key = follow_url.rsplit("/")[-1].replace(".html", "")
        # print(f"Fetching data for {key}...")
        dfs = pd.read_html(
            connection_session.get(follow_url).content.decode("utf-8"),
            flavor="bs4",
        )
        # https://stackoverflow.com/questions/39710903/pd-read-html-imports-a-list-rather-than-a-dataframe
        for df in dfs:
            df = dfs[0].T.iloc[1:, :].copy()
            updated_df = updated_df.append(df)
            print(updated_df)

cols = ['项目编号', '转让/出租标的名称', '转让方/出租方名称', '转让标的评估价/年租金评估价（元）', 
        '转让底价/年租金底价（元）', '受让方/承租方名称', '成交价/成交年租金（元）', '成交日期']
updated_df.columns = cols
updated_df.to_excel('./data.xlsx', index = False)

标签： python-3.xweb-scrapingbeautifulsouppython-requestsweb-crawler

解决方案

这是我将如何做到这一点：

构建所有main urls
访问每个main page
得到follow urls
参观每个follow url
从桌子上抢follow url
解析表pandas
将表添加到pandas数据框字典
处理表格（不包括 -> 实现您的逻辑）

重复这些2 - 7步骤以继续抓取数据。

编码：

import pandas as pd
import requests
from bs4 import BeautifulSoup

BASE_URL = "http://www.jscq.com.cn/dsf/zc/cjgg"


def get_main_urls() -> list:
    start_url = f"{BASE_URL}/index.html"
    return [start_url] + [f"{BASE_URL}/index_{i}.html" for i in range(1, 6)]


def get_follow_urls(urls: list, session: requests.Session()) -> iter:
    for url in urls[:1]:  # remove [:1] to scrape all the pages
        body = session.get(url).content
        s = BeautifulSoup(body, "lxml").find_all("td", {"width": "60%"})
        yield from [f"{BASE_URL}{a.find('a')['href'][1:]}" for a in s]


dataframe_collection = {}

with requests.Session() as connection_session:  # reuse your connection!
    for follow_url in get_follow_urls(get_main_urls(), connection_session):
        key = follow_url.rsplit("/")[-1].replace(".html", "")
        print(f"Fetching data for {key}...")
        df = pd.read_html(
            connection_session.get(follow_url).content.decode("utf-8"),
            flavor="bs4",
        )
        dataframe_collection[key] = df

    # process the dataframe_collection here

# print the dictionary of dataframes (optional and can be removed)
for key in dataframe_collection.keys():
    print("\n" + "=" * 40)
    print(key)
    print("-" * 40)
    print(dataframe_collection[key])

输出：

Fetching data for t20210311_30347...
Fetching data for t20210311_30346...
Fetching data for t20210305_30338...
Fetching data for t20210305_30337...
Fetching data for t20210303_30323...
Fetching data for t20210225_30306...
Fetching data for t20210225_30305...
Fetching data for t20210225_30304...
Fetching data for t20210225_30303...
Fetching data for t20210209_30231...

and then ...

python-3.x - 循环页面并将详细内容保存为 Python 中的数据框

问题描述

解决方案

推荐阅读