首页 > 解决方案 > 我在 python 中使用 beautifulSoup 将一页转到另一页时遇到问题?

问题描述

在抓取数据时将一页转到另一页时遇到错误。代码执行没有错误,但访问的 url 应该从 1 更新到 max_pages 但不是这样的 url:

https://www.yelp.com/user_details_reviews_self?userid=_NpJZ0q8KVI-d2YLL_VpCA&rec_pagestart=00

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

url = input("Enter the URL : ")
max_pages = int(input("Enter the Maximum Number of Pages you want to Extract : "))

for i in range(1, max_pages+1):
    my_url = url[::-1].replace('1',str(i) ,1)[::-1]
    uClient = uReq(my_url)
    page_html = uClient.read()
    page_soup = soup(page_html, "html.parser")

标签: pythonpython-3.x

解决方案


错误在这里:

my_url = url[::-1].replace('1',str(i) ,1)[::-1]

您尝试替换1为,str(i)1url 内没有,https://www.yelp.com/user_details_reviews_self?userid=_NpJZ0q8KVI-d2YLL_VpCA&rec_pagestart=00因此无法更新。

无论如何,我在这里看不到您的问题的任何好的解决方案。如果你让用户给你他想要的任何地址,你可以有一些这样的网址:

https://www.url1.com?n=1&p=1

p页码在哪里,其他类似的

https://www.url11.com?p=1&n=1
https://www.url111.com?n=1&p=1

这次n是页码。

祝您找到一种自动更改所有这些 url 的页码的方法。

如果你的解析器是为 yelp 编码的,我会做这样的事情:

from urllib.request import urlopen as uReq
from bs4 import BeautifulSoup as soup

user_id = input("Enter the YELP user id : ")
max_pages = int(input("Enter the Maximum Number of Pages you want to Extract : "))

url_base = "https://www.yelp.com/user_details_reviews_self?userid={}".format(user_id)

for i in range(0, max_pages):
    page = "&rec_pagestart{:02d}".format(i*10)
    url = url_base + page
    print(url)
    #do parsing stuff

它解析 10 个不同的页面:

Enter the YELP user id : _NpJZ0q8KVI-d2YLL_VpCA
Enter the Maximum Number of Pages you want to Extract : 10
https://www.yelp.com/user_details_reviews_self?userid=_NpJZ0q8KVI-d2YLL_VpCA&rec_pagestart00
https://www.yelp.com/user_details_reviews_self?userid=_NpJZ0q8KVI-d2YLL_VpCA&rec_pagestart10
https://www.yelp.com/user_details_reviews_self?userid=_NpJZ0q8KVI-d2YLL_VpCA&rec_pagestart20
https://www.yelp.com/user_details_reviews_self?userid=_NpJZ0q8KVI-d2YLL_VpCA&rec_pagestart30
https://www.yelp.com/user_details_reviews_self?userid=_NpJZ0q8KVI-d2YLL_VpCA&rec_pagestart40
https://www.yelp.com/user_details_reviews_self?userid=_NpJZ0q8KVI-d2YLL_VpCA&rec_pagestart50
https://www.yelp.com/user_details_reviews_self?userid=_NpJZ0q8KVI-d2YLL_VpCA&rec_pagestart60
https://www.yelp.com/user_details_reviews_self?userid=_NpJZ0q8KVI-d2YLL_VpCA&rec_pagestart70
https://www.yelp.com/user_details_reviews_self?userid=_NpJZ0q8KVI-d2YLL_VpCA&rec_pagestart80
https://www.yelp.com/user_details_reviews_self?userid=_NpJZ0q8KVI-d2YLL_VpCA&rec_pagestart90

推荐阅读