首页 > 解决方案 > 使用 python 抓取网页

问题描述

我正在从https://www.consumeraffairs.com/privacy/transunion.html网站上抓取所有评论

    page_list = []
    def pagination(soup):
        for i in range(0,32):
            domain = "https://www.consumeraffairs.com/privacy/transunion.html?page="+str(i)                        
            page_list.append(domain)
        return page_list
    pages = pagination(soup)

    print(pages)

如何捕获这些页面下显示的评论

    import time
    comment_list = []
    def get_comments(urls):
        for url in urls:
            try:
                print(url)
                #comment = soup.find_all('div',{'class':'rvw-bd'})
                comment = soup.find_all('div',{'class':'rvw-bd'})             
                print(len(comment))
                for x in range(len(comment)):
                    comment_list.append(comment[x].p.text.strip())            
            except:
                continue
                time.sleep(30)
        return comment_list
    comments = get_comments(pages)

我使用了这段代码,但它只删除了第一页的前 10 个。如何解决这个问题

标签: python

解决方案


我认为您更改 url 中的“page=”值是正确的,但是从您发布的代码来看,您似乎并没有更改汤对象来表示每个新页面的内容。我重写了你的一些代码来做到这一点:

from bs4 import BeautifulSoup
import requests
import time

page_list = []
for i in range(0,32):
    domain = "https://www.consumeraffairs.com/privacy/transunion.html?page="+str(i)                        
    page_list.append(domain)

comment_list = []
for page in page_list:
    try:
        content = requests.get(page).content
        soup = BeautifulSoup(content, 'html.parser')
        #comment = soup.find_all('div',{'class':'rvw-bd'})

        comment = soup.find_all('div',{'class':'rvw-bd'})             
        print(len(comment))

        for x in range(len(comment)):
            comment_list.append(comment[x].p.text.strip())            
    except:
        continue
        time.sleep(30)

print(comment_list)
print(len(comment_list))

让我知道这是否有帮助!


推荐阅读