python - 用于下载串行顺序网站 url ids 的 python 循环
问题描述
虽然我尝试按顺序上升趋势循环“urlpage”,但这只会给我 0021 zip 文件,并且只有在 firefox 要求我下载后才能获得该文件。我的代码有什么问题,如何让它循环从我的循环中的序列号打开所有 url?
import urllib.request
from bs4 import BeautifulSoup
from selenium import webdriver
import time
import pandas as pd
import os
j=''
k=1
while k < 4:
j='002'+ str(k)
print(str(j))
if k>0:
urlpage = 'https://www150.statcan.gc.ca/n1/tbl/csv/3210'+j+'-eng.zip'
print(urlpage)
k+=1
# run firefox webdriver from executable path of your choice
driver = webdriver.Firefox()
# get web page
driver.get(urlpage)
# execute script to scroll down the page
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)
driver.quit()
0021
https://www150.statcan.gc.ca/n1/tbl/csv/32100021-eng.zip
解决方案
所以我不明白你为什么要向下滚动那个特定urlpage
的 . 您不能向下滚动 zip 文件。您的链接会将您直接带到必须下载的 zip 文件。我曾经用 chromedriver 做过类似的事情,所以也许这会有所帮助。我不确定 FireFox 驱动程序是否会有所不同(至少不会有任何不同chrome_options
)
Python = 3.6
和selenium.__version__ = 3.14.1
import time
import zipfile
import pathlib
from selenium import webdriver
cwd = pathlib.Path.cwd()
chrome_driver = cwd / 'chromedriver.exe'
download_folder = cwd / 'downloads' # make sure this folder exists
# You could use an f"" string on urlpage
j=''
k=1
while k < 4:
j='002'+ str(k)
print(str(j))
if k>0: # may not be necessary
urlpage = 'https://www150.statcan.gc.ca/n1/tbl/csv/3210'+j+'-eng.zip'
print(urlpage)
k+=1
# run chrome instead - the only reason for this is because I used it before :)
options = webdriver.ChromeOptions()
options.add_experimental_option("prefs", {"download.default_directory": str(download_folder)})
driver = webdriver.Chrome(str(chrome_driver), chrome_options=options)
# get web page
driver.get(urlpage)
# Your page is not a WEBPAGE. it is a ZIP file. You cannot scroll anywhere on a zip file
# driver.execute_script("window.scrollTo(0, document.body.scrollHeight);var lenOfPage=document.body.scrollHeight;return lenOfPage;")
# sleep for 30s
time.sleep(30)
# you can unzip here if you want
downloaded_file = urlpage.split('/')[-1]
directory_to_unzip_to = download_folder / downloaded_file.split('.')[0]
zip_ref = zipfile.ZipFile(download_folder / downloaded_file, 'r')
zip_ref.extractall(directory_to_unzip_to)
zip_ref.close()
driver.quit()
输出:
推荐阅读
- firebase - 尝试在 Xamarin 表单 iOS 应用程序中添加 Crashlytics
- c# - 无法将参数值从 Binary 转换为 Byte[]。-- 带有 SQL 时间戳参数的 SqlDataSource
- java - Undertow 处理程序使所有堆栈非阻塞
- python - 计算表中的行数
- reactjs - 如何在 Jest 和 Enzyme 中执行无状态组件内的函数以进行测试
- python - From x_1.0 import y - 如何处理带有版本的导入?
- angular - Angular - 等待 ngIf 准备好以避免其中 elementRef 出现未定义的错误
- c# - 重新部署 tizen 应用程序后记住权限
- string - 将分隔字符串转换为行
- php - 无法使用for循环PHP将多个表行插入到oracle(10g)数据库