首页 > 解决方案 > 如何用while循环实现beautifulsoup并持续检测数据变化

问题描述

我正在抓取一个网站,我将在其中抓取一个网站链接,并且更新时间约为 6-8 小时。如果数据没有改变,那么它保持不变。基本上,这意味着我不必一直单击运行以查看数据是否已更改。

除此之外,我想通过将文件制作成 csv 来进行刮擦。在这里,我附上我在网站上进行抓取的代码

import csv
import re
import requests
from bs4 import BeautifulSoup

url = "https://www.ndbc.noaa.gov/station_page.php?station=56003"
request_headers = {
    "user-agent": ("Mozilla / 5.0 (Windows NT 10.0; Win64; x64)"
                   "AppleWebKit / 537.36 (KHTML, like Gecko)"
                   "Chrome / 88.0.4324.150 Safari / 537.36 Edg / 88.0.705.63")
}
response = requests.get (url, headers = request_headers)
response.raise_for_status ()
soup = BeautifulSoup (response.text, "html.parser")
headers = ["Year", "Month", "Day", "Hour", "Minute", "Second", "T", "Height"]

with open ("station-56003.csv", "w") as f:
    writer = csv.writer (f, lineterminator = "\ n")
    writer.writerow (headers)

    for line in soup.select_one ("# data"). text.split ("\ n"):
        if re.fullmatch (r "[\ d.] {30}", line) and len (line.split ()) == len (headers):
            writer.writerow (line.split ()) 

标签: pythonbeautifulsoup

解决方案


考虑检查 csv 文件的校验和。如果它发生了变化,则意味着有新数据。


推荐阅读