python-3.x - Python爬虫(bs4,urlopen)故障
问题描述
我正在玩一个包含 mtg 卡的网页,我正在尝试提取一些关于它们的信息。以下程序运行良好,我能够抓取一个页面并检索所有需要的信息:
import re
from math import ceil
from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
def NumOfNextPages(TotalCardNum, CardsPerPage):
pages = ceil(TotalCardNum / CardsPerPage)
return pages
URL = "xyz.com"
NumOfCrawledPages = 0
UClient = uReq(URL) # downloading the url
page_html = UClient.read()
UClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
# Finds all the cards that exist in the webpage and stores them as a bs4 object
cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
CardsPerPage = len(cards)
# Selects the card names, Power and Toughness, Set that they belong
for card in cards:
card_name = card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")
if len(card.div.contents) > 3:
cardP_T = card.div.contents[3].contents[1].text.replace("\n", "").strip()
else:
cardP_T = "Does not exist"
cardType = card.contents[3].text
print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")
# Trying to extract the next URL after this page, but there is not always a next page to retrieve, so an exception(IndexError) is produced due to our tries to access an index in a list that is empty, zero index is not available
try:
URL_Next = "xyz.com/" + page_soup.findAll("li", {"class":
"next"})[0].contents[0].get("href")
except IndexError:
# End of crawling because of IndexError! Means that there is no next
#page to crawl
print("Crawling process completed! No more infomation to retrieve!")
else:
print("The nex t URL is: " + URL_Next + "\n")
NumOfCrawledPages += 1
finally:
print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")
# We need to find the overall card number available, to find the number of
#pages that we need to crawl
# we drag those infomation from a "div" tag with class "summary"
OverallCardInfo = (page_soup.find("div", {"class": "summary"})).text
TotalCardNum = int(re.findall("\d+", OverallCardInfo)[2])
NumOfPages = NumOfNextPages(TotalCardNum, CardsPerPage)
有了这个,我可以抓取我手动提供的第一页,并为我需要抓取的页面总数以及下一个 url 提取一些信息。
最终我想给出一个起点(网页),然后爬虫会自行进入其他网页。所以我使用了以下 for 循环:
for i in range(0, NumOfPages):
# The number of items shown by the search option on xyz.com can
#not be more than 10000
if ((NumOfCrawledPages + 1) * CardsPerPage) >= 10000:
print("Number of results provided can not exceed 10000!\nEnd of the
crawling!")
break
if i == 0:
Url = InitURL
else:
Url = URL_Next
# opening up connection and crabbing the page
UClient = uReq(Url) # downloading the url
page_html = UClient.read()
UClient.close()
# html parsing
page_soup = soup(page_html, "html.parser")
# Finds all the cards that exist in the webpage and stores them as a bs4
#object
cards = page_soup.findAll("div", {"class": ["iso-item", "item-row-view"]})
# Selects the card names, Power and Toughness, Set that they belong
for card in cards:
card_name =
card.div.div.strong.span.contents[3].contents[0].replace("\xa0 ", "")
if len(card.div.contents) > 3:
cardP_T = card.div.contents[3].contents[1].text.replace("\n",
"").strip()
else:
cardP_T = "Does not exist"
cardType = card.contents[3].text
print(card_name + "\n" + cardP_T + "\n" + cardType + "\n")
# Trying to extract the next URL after this page, but there is not our #tries to access an index in a list that is empty, zero index is not available
try:
URL_Next = "xyz.com" + page_soup.findAll("li", {"class": "next"})[0].contents[0].get("href")
except IndexError:
# End of crawling because of IndexError! Means that there is no next #page to crawl
print("Crawling process completed! No more infomation to retrieve!")
else:
print("The next URL is: " + URL_Next + "\n")
NumOfCrawledPages += 1
Url = URL_Next
finally:
print("Moving to page : " + str(NumOfCrawledPages + 1) + "\n")
带有附加 for 循环的第二个代码运行没有错误,但结果不是预期的。它返回我手动输入的第一页的抓取结果,它不会在其他页面中继续进行......
为什么会这样?
预期的输出类似于:
龙语萨满 P/T: 2/2 生物 - 人类野蛮人萨满
龙语萨满 P/T: 2/2 生物 - 人类野蛮人萨满
Dragonstalker P/T: 3/3 生物 - 鸟兵
下一个网址是:xyz.com/......
移动到页面:2
---------------------------------------------首页爬取结束
龙语萨满 P/T: 2/2 生物 - 人类野蛮人萨满
龙语萨满 P/T: 2/2 生物 - 人类野蛮人萨满
Dragonstalker P/T: 3/3 生物 - 鸟兵
下一个网址是:xyz.com/......
移动到页面:3
Url
从手动给定的网页中检索此信息后,它应该继续保存在for 循环中的变量中的下一页。相反,它会一次又一次地继续爬取同一页面。计数器工作得很好,因为它计算了抓取的页面数,但Url
变量似乎没有改变值。
解决方案
推荐阅读
- python - Selenium webdriver 使用终端意外关闭
- geth - geth --rinkeby 找不到同行
- ios - 为什么“Swift Compiler - Custom Flags”设置没有出现在构建设置中?
- html - 无法在移动浏览器中单击表单上的提交按钮
- bash - 使用 BC 进行浮点数比较
- angular - Firefox 在控制台中显示带有谷歌分析 cookie 的错误
- r - Adjust function so that it instead of it looping through all rows, it loops only through all rows within groups
- ios - 如何删除统一ios键盘“完成”“取消”按钮
- javascript - 为什么在事件对象中使用extendedProps时FullCalendar事件解析仅限于Location和Description?
- django - razorpay 支付网关与 django 应用程序的集成