python - BeautifulSoup scraping pactpub html element returns empty item
问题描述
I have been trying to scrape the following
<li class="ais-pagination--item ais-pagination--item__next">
tag from https://www.packtpub.com/all-products/all-books which represent the next page button
using the following code:
import requests
import re
import bs4
headers = {"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36"}
start_url="https://www.packtpub.com/all-products/all-books"
req = requests.get(start_url)
soup = bs4.BeautifulSoup(req.content, "lxml")
next_page_li = soup.find("li", class_="ais-pagination--item ais-pagination--item__next")
I get None in return, please help.
解决方案
这背后的原因是,您正在处理一个网站,一旦页面通过XHR 请求加载到以下加载数据POST
的后端APIJSON
,它的内容就会动态加载。
因此,为了验证网站内容是动态的还是静态的,您必须查看页面源并搜索所需的元素,如果它在那里,那么它就是静态内容,如果不是,那么您正在处理您的动态对象必须追查它是从哪里获得的。
请注意,查看页面源与直接检查元素不同。因为检查元素将在两种情况下查看它。但是您会注意到任何动态内容
event
旁边都会有一个标志。
下面是对 的直接调用API
,以防您想循环遍历结果,那么您必须使用hitsPerPage
它最多可以获得1000
每个调用。然后你必须遍历页面。由于该网站包含5660
,因此您必须循环进入,range(6)
因为该网站的分页从0
.
import requests
def main(url):
with requests.Session() as req:
params = {
"x-algolia-agent": "Algolia for vanilla JavaScript (lite) 3.27.0;instantsearch.js 2.10.2;Magento2 integration (1.13.3);JS Helper 2.26.0",
"x-algolia-application-id": "VIVZZXFQG1",
"x-algolia-api-key": "MjBiNTIwZWM0MmE4MWQ0MDQwNzIxY2Q5ZTQ0ZjE0ZDNkMzI4ZDVkZWJiYzcxNGI1NjA2MWYzNmUyNTQxY2ViZnRhZ0ZpbHRlcnM9"
}
data = {"requests": [{"indexName": "store_prod_us_products_packt_rank_asc", "params": "query=&hitsPerPage=24&maxValuesPerFacet=10&page=0&ruleContexts=%5B%22%22%2C%22magento-category-7164%22%5D&clickAnalytics=true&facets=%5B%22product_type_filter%22%2C%22released%22%2C%22language%22%2C%22concept%22%2C%22tool%22%2C%22vendor%22%2C%22categories.level0%22%2C%22categories.level1%22%2C%22categories.level2%22%2C%22categories.level0%22%2C%22categories.level1%22%2C%22categories.level2%22%5D&tagFilters=&facetFilters=%5B%5B%22released%3AAvailable%22%5D%2C%5B%22categories.level1%3AAll%20Products%20%2F%2F%2F%20All%20Books%22%5D%5D&numericFilters=%5B%22visibility_catalog%3D1%22%5D"}, {"indexName": "store_prod_us_products_packt_rank_asc", "params": "query=&hitsPerPage=1&maxValuesPerFacet=10&page=0&ruleContexts=%5B%22%22%2C%22magento-category-7164%22%5D&clickAnalytics=false&attributesToRetrieve=%5B%5D&attributesToHighlight=%5B%5D&attributesToSnippet=%5B%5D&tagFilters=&analytics=false&facets=released&numericFilters=%5B%22visibility_catalog%3D1%22%5D&facetFilters=%5B%5B%22categories.level1%3AAll%20Products%20%2F%2F%2F%20All%20Books%22%5D%5D"}, {
"indexName": "store_prod_us_products_packt_rank_asc", "params": "query=&hitsPerPage=1&maxValuesPerFacet=10&page=0&ruleContexts=%5B%22%22%2C%22magento-category-7164%22%5D&clickAnalytics=false&attributesToRetrieve=%5B%5D&attributesToHighlight=%5B%5D&attributesToSnippet=%5B%5D&tagFilters=&analytics=false&facets=%5B%22categories.level0%22%2C%22categories.level1%22%5D&numericFilters=%5B%22visibility_catalog%3D1%22%5D&facetFilters=%5B%5B%22released%3AAvailable%22%5D%2C%5B%22categories.level0%3AAll%20Products%22%5D%5D"}, {"indexName": "store_prod_us_products_packt_rank_asc", "params": "query=&hitsPerPage=1&maxValuesPerFacet=10&page=0&ruleContexts=%5B%22%22%2C%22magento-category-7164%22%5D&clickAnalytics=false&attributesToRetrieve=%5B%5D&attributesToHighlight=%5B%5D&attributesToSnippet=%5B%5D&tagFilters=&analytics=false&facets=%5B%22categories.level0%22%5D&numericFilters=%5B%22visibility_catalog%3D1%22%5D&facetFilters=%5B%5B%22released%3AAvailable%22%5D%5D"}]}
r = req.post(url, params=params, json=data).json()
for x in r['results'][0]['hits']:
print(x['name'])
if __name__ == "__main__":
main('https://vivzzxfqg1-dsn.algolia.net/1/indexes/*/queries')
输出:
C# 9 and .NET 5 – Modern Cross-Platform Development - Fifth Edition
40 Algorithms Every Programmer Should Know
Machine Learning for Algorithmic Trading - Second Edition
Learning C# by Developing Games with Unity 2020 - Fifth Edition
Solutions Architect's Handbook
Python Machine Learning - Third Edition
The Python Workshop
Kubernetes and Docker - An Enterprise Guide
Django 3 By Example - Third Edition
Full-Stack React, TypeScript, and Node
Responsive Web Design with HTML5 and CSS - Third Edition
Learn Python Programming - Second Edition
CompTIA Security+: SY0-601 Certification Guide - Second Edition
Hands-On Quantum Information Processing with Python
Node.js Design Patterns - Third Edition
ASP.NET Core 5 for Beginners
Learning Tableau 2020 - Fourth Edition
AWS Penetration Testing
Python 3 Object-Oriented Programming - Third Edition
Hands-On Unity 2020 Game Development
Software Architecture with C# 9 and .NET 5 - Second Edition
Mastering Blockchain - Third Edition
The Docker Workshop
Data Engineering with Python
推荐阅读
- php - 信号量群发短信
- flutter - Listview 和 Gridview 的容器高度
- java - 获取无法调用“java.sql.connect.createStatement()”,因为“connectDB”为空
- android - 致命异常:使用 lottie 库时出现 java.lang.StackOverflowError
- python - 使用 Pytest 夹具和 Flask 测试 LiveServerTestCase
- swift - 无法从应用程序扩展访问我的“CKContainer”中的任何数据 - CloudKit,Swift
- tensorflow - Debug broken Tensorflow-gpu installation with Conda (1.14 vs 2.3), Ubuntu 18.04
- linq - 一个 LinqPad 查询中的多个数据源以比较 Dynamics 365 和 Dynamics NAV 数据
- nativescript - 使用 nativescript-drop-down 插件时出错
- javascript - 服务器发送事件意外停止