首页 > 解决方案 > 用于下载串行顺序网站 url ids 的 python 循环

问题描述

虽然我尝试按顺序上升趋势循环“urlpage”,但这只会给我 0021 zip 文件,并且只有在 firefox 要求我下载后才能获得该文件。我的代码有什么问题,如何让它循环从我的循环中的序列号打开所有 url?

import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd
import os

j=''
k=1
while k < 4:
    j='002'+ str(k)
    print(str(j))
    if k>0:
        urlpage = 'https://www150.statcan.gc.ca/n1/tbl/csv/3210'+j+'-eng.zip' 
        print(urlpage)
    k+=1
        # run firefox webdriver from executable path of your choice
    driver = webdriver.Firefox()
        # get web page
    driver.get(urlpage)
        # execute script to scroll down the page
    driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
        # sleep for 30s
    time.sleep(30)
    driver.quit()
0021
https://www150.statcan.gc.ca/n1/tbl/csv/32100021-eng.zip

标签: python

解决方案


所以我不明白你为什么要向下滚动那个特定urlpage的 . 您不能向下滚动 zip 文件。您的链接会将您直接带到必须下载的 zip 文件。我曾经用 chromedriver 做过类似的事情,所以也许这会有所帮助。我不确定 FireFox 驱动程序是否会有所不同(至少不会有任何不同chrome_options

Python = 3.6selenium.__version__ = 3.14.1

import time
import zipfile
import pathlib
from selenium import webdriver

cwd = pathlib.Path.cwd()
chrome_driver = cwd / 'chromedriver.exe'
download_folder = cwd / 'downloads' # make sure this folder exists

# You could use an f"" string on urlpage
j=''
k=1
while k < 4:
    j='002'+ str(k)
    print(str(j))

    if k>0: # may not be necessary
        urlpage = 'https://www150.statcan.gc.ca/n1/tbl/csv/3210'+j+'-eng.zip' 
        print(urlpage)

    k+=1
    # run chrome instead - the only reason for this is because I used it before :)
    options = webdriver.ChromeOptions()
    options.add_experimental_option("prefs", {"download.default_directory": str(download_folder)})
    driver = webdriver.Chrome(str(chrome_driver), chrome_options=options)

    # get web page
    driver.get(urlpage)

    # Your page is not a WEBPAGE. it is a ZIP file. You cannot scroll anywhere on a zip file
    # driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")

    # sleep for 30s
    time.sleep(30)

    # you can unzip here if you want
    downloaded_file = urlpage.split('/')[-1]
    directory_to_unzip_to = download_folder / downloaded_file.split('.')[0]
    zip_ref = zipfile.ZipFile(download_folder / downloaded_file, 'r')
    zip_ref.extractall(directory_to_unzip_to)
    zip_ref.close()

    driver.quit()

输出:

在此处输入图像描述


推荐阅读