python - 尝试循环浏览网页以进行数据抓取时出错
问题描述
我已经编写了从第一页中提取数据的代码,但是在尝试从所有页面中提取数据时遇到了问题。
这是我从页面“a”中提取数据的代码
from bs4 import BeautifulSoup
import urllib
import urllib.request
import os
from string import ascii_lowercase
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, 'html.parser')
return soupdata
playerdatasaved = ""
soup = make_soup('https://www.basketball-reference.com/players/a/')
for record in soup.findAll("tr"):
playerdata = ""
for data in record.findAll(["th","td"]):
playerdata = playerdata + "," + data.text
playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
print(playerdatasaved)
header = "player, from, to, position, height, weight, dob, year,
colleges"+"\n"
file = open(os.path.expanduser("basketballstats.csv"),"wb")
file.write(bytes(header, encoding = "ascii", errors = "ignore"))
file.write(bytes(playerdatasaved[1:], encoding = "ascii", errors = "ignore"))
现在循环页面,我的逻辑是这段代码
from bs4 import BeautifulSoup
import urllib
import urllib.request
import os
from string import ascii_lowercase
def make_soup(url):
thepage = urllib.request.urlopen(url)
soupdata = BeautifulSoup(thepage, 'html.parser')
return soupdata
playerdatasaved = ""
for letter in ascii_lowercase:
soup = make_soup("https://www.basketball-reference.com/players/" + letter + "/")
for record in soup.findAll("tr"):
playerdata = ""
for data in record.findAll(["th","td"]):
playerdata = playerdata + "," + data.text
playerdatasaved = playerdatasaved + "\n" + playerdata[1:]
header = "player, from, to, position, height, weight, dob, year,
colleges"+"\n"
file = open(os.path.expanduser("basketball.csv"),"wb")
file.write(bytes(header, encoding = "ascii", errors = "ignore"))
file.write(bytes(playerdatasaved[1:], encoding = "ascii", errors = "ignore"))
但是,这遇到了与该行相关的错误:soup = make_soup(" https://www.basketball-reference.com/players/ " + letter + "/")
解决方案
我尝试运行您的代码并遇到 ssl 证书错误 CERTIFICATE_VERIFY_FAILED 这似乎是您尝试抓取的网站而不是您的代码的问题。
也许这个堆栈可以帮助清除一些东西: 抓取 https://www.thenewboston.com/ 时出现“SSL:certificate_verify_failed”错误
推荐阅读
- php - 通过允许的用户 ID 添加对 WooCommerce 优惠券的限制
- docker - 如何从 docker swarm docker-stack.yml 文件中获取堆栈
- bash - Bash 过滤器文件与特定模式不匹配
- r - data.frame 子集:使用数字日期差的非常奇怪的行为
- python - 在python中解析json数据列表
- html - 向导航添加内容时,导航栏的一部分不起作用
- java - 从 Jenkins 获取测试邮件,但构建完成后没有收到邮件
- python - Kivy 在 python 中不返回值
- python - 如何在 Python 中使用 for 循环访问列表对象的方法?
- pyspark - 获取pyspark.sql.utils.ParseException:不匹配的输入'('期待{