首页 > 解决方案 > 使用beautifulsoup查询多页抓取

问题描述

我正在尝试使用网站链接中的 beautifulsoup 抓取页面 - https://concreteplayground.com/auckland/events。我能够从第 1 页中提取所有内容。当我想移动到下一页时,我找不到对链接/解析下一页的任何引用。我尝试检查页面,当我检查移动到第 2 页时,我找到了内容如下——

<a rel="nofollow" class="page-numbers" href="">2</a>

我不太确定如何处理这种类型的网页。如果有人能帮我解决这个问题,那就太好了。正在获取下一页内容并显示在同一个 url 中。不确定后台发生了什么出色地。感谢和问候

标签: python-3.xweb-scrapingbeautifulsoupscreen-scraping

解决方案


很抱歉之前的垃圾答案。我发现了 selenium 的点击功能哈哈。无论如何,您想要的页面是 Ajax 繁重,需要一种不同于传统 HTML 抓取的方法。请参阅以下链接以了解有关您必须在以下链接上处理的 URL 的更多信息:处理 Ajax。所以基本上,运行一个脚本,它允许在不更改主 URL 的情况下进行分页。

以下是我尝试实现您要求的输出。如果有人找到改进以简化它,我将不胜感激。

#Import essentials
import requests
from bs4 import BeautifulSoup


#Not necessary, but always useful just in case
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/66.0.3359.181 Safari/537.36'}


#Read url, parse using BeautifulSoup, and dynamically find no of pages
temp_page = requests.get('https://concreteplayground.com/auckland/events', headers=headers)
soup = BeautifulSoup (temp_page.content, 'html.parser')
PgNos = len(soup.findAll('li', {'class':'page'}))


#Now for the interesting part!

#Form the url to which requests are to be sent. This url is used to GET every json response which I've later parsed and printed. This url is available in the network tab of developer tools of your browser (like Firebug)
for i in range(PgNos+1):
    u = 'https://concreteplayground.com/ajax.php?post_type=tribe_events&place_type=event&region=auckland&sort=all&paged='
    r = str(i)
    l = '&action=directory_search&user_lat=&user_lon='
    url = u+r+l
    response = requests.get(url, headers=headers)
    data = response.json()

    #Now,iterate through the main body of the json to get what you want
    for each in data['results']:

        event_name = each['post_title']
        event_excerpt = each['post_excerpt']

        #There's a li'l HTML bit here, so you ought'a use BS to parse that. 
        rdata = each['info']
        raw = BeautifulSoup(rdata, 'lxml')
        date = raw.p.text
        rawvenue = raw.findAll('span', {'itemprop':'name'})
        venuename = rawvenue[0].text
        venueaddress = rawvenue[0].meta['content']

        #Obviously, you can also write to a file in lieu of the below. 
        print ('Event : ' + event_name + '\n' + 'Excerpt : ' + event_excerpt + '\n' +'Date : ' +  date + '\n' + 'Venue : ' + venuename + '\n' + 'Address : ' + venueaddress + '\n\n')

这些资源在重建我的答案时也很有用:GET and POST explanation and JSON iteration


推荐阅读