首页 > 解决方案 > 为什么我的代码只循环使用 BeautifulSoup 的第一个网页?

问题描述

在最近了解它之后,我只是在搞乱 BeautifulSoup 并在不同的网站上对其进行测试。我目前正在尝试遍历多个页面,而不仅仅是第一页。我可以附加或编写从我想要的任何特定页面获取的信息,但我当然希望将其自动化。

这是我当前的代码,试图让它工作到第五页。目前它只通过第一个网页并将我正在寻找的相同信息写入我的 excel 文件,五次。在我的嵌套 for 循环中,我有一些打印语句只是为了在查看文件之前查看它是否在控制台上工作。

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup
import unicodecsv as csv

f = open("on_sale_games.csv", "w", encoding='utf-8')
headers = "Game name, Original price, Final price, Percent off\n"
f.write(headers)

for i in range(5):
    my_url = 'https://store.steampowered.com/specials#p={}&tab=TopSellers'.format(i+1)

    uClient = uReq(my_url)  # open up the url and download the page.
    page_html = uClient.read()  # reading the html page and storing the info into page_html.
    uClient.close()  # closing the page.

    page_soup = soup(page_html, 'html.parser')  # html parsing

    containers = page_soup.findAll("a", {"class": "tab_item"})

    for container in containers:
        name_stuff = container.findAll("div", {"class": "tab_item_name"})
        name = name_stuff[0].text
        print("Game name:", name)

        original_price = container.findAll("div", {"class": "discount_original_price"})
        original = original_price[0].text
        print("Original price:", original)

        discounted_price = container.findAll("div", {"class": "discount_final_price"})
        final = discounted_price[0].text
        print("Discounted price:", final)

        discount_pct = container.findAll("div", {"class": "discount_pct"})
        pct = discount_pct[0].text
        print("Percent off:", pct)

        f.write(name.replace(':', '').replace("™", " ") + ',' + original + ',' + final + ',' + pct + '\n')

f.close()

标签: pythonbeautifulsoupexport-to-csv

解决方案


检查浏览器发出的请求,我注意到后台有一个请求来获取数据并获得json结果,你可以从那里开始工作:

for i in range(5):
    my_url = 'https://store.steampowered.com/contenthub/querypaginated/specials/NewReleases/render/?query=&start={}'.format(i*15)
    uClient = uReq(my_url)
    page_html = uClient.read()
    uClient.close()
    data = json.loads(page_html)["results_html"]
    page_soup = soup(data, 'html.parser')
    # Rest of the code

这就像一个 API,每页获取 15 个元素,所以它从 0、15、30 等开始。


推荐阅读