首页 > 解决方案 > 尝试循环浏览网页以进行数据抓取时出错

问题描述

我已经编写了从第一页中提取数据的代码,但是在尝试从所有页面中提取数据时遇到了问题。

这是我从页面“a”中提取数据的代码

from bs4 import BeautifulSoup
import urllib
import urllib.request
import os
from string import ascii_lowercase


def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, 'html.parser')
    return soupdata

playerdatasaved = ""

soup = make_soup('https://www.basketball-reference.com/players/a/')

for record in soup.findAll("tr"): 
    playerdata = "" 
    for data in record.findAll(["th","td"]): 
        playerdata = playerdata + "," + data.text 

    playerdatasaved = playerdatasaved + "\n" + playerdata[1:]

print(playerdatasaved)

header = "player, from, to, position, height, weight, dob, year, 
colleges"+"\n"
file = open(os.path.expanduser("basketballstats.csv"),"wb")
file.write(bytes(header, encoding = "ascii", errors = "ignore"))
file.write(bytes(playerdatasaved[1:], encoding = "ascii", errors = "ignore"))

现在循环页面,我的逻辑是这段代码

from bs4 import BeautifulSoup
import urllib
import urllib.request
import os
from string import ascii_lowercase

def make_soup(url):
    thepage = urllib.request.urlopen(url)
    soupdata = BeautifulSoup(thepage, 'html.parser')
    return soupdata

playerdatasaved = ""
for letter in ascii_lowercase:
    soup = make_soup("https://www.basketball-reference.com/players/" + letter + "/")
    for record in soup.findAll("tr"):
        playerdata = "" 
        for data in record.findAll(["th","td"]): 
            playerdata = playerdata + "," + data.text 

        playerdatasaved = playerdatasaved + "\n" + playerdata[1:]

header = "player, from, to, position, height, weight, dob, year, 
colleges"+"\n"
file = open(os.path.expanduser("basketball.csv"),"wb")
file.write(bytes(header, encoding = "ascii", errors = "ignore"))
file.write(bytes(playerdatasaved[1:], encoding = "ascii", errors = "ignore"))

但是,这遇到了与该行相关的错误:soup = make_soup(" https://www.basketball-reference.com/players/ " + letter + "/")

标签: pythonpython-3.x

解决方案


我尝试运行您的代码并遇到 ssl 证书错误 CERTIFICATE_VERIFY_FAILED 这似乎是您尝试抓取的网站而不是您的代码的问题。

也许这个堆栈可以帮助清除一些东西: 抓取 https://www.thenewboston.com/ 时出现“SSL:certificate_verify_failed”错误


推荐阅读