python - 在 Python 中抓取一个 url
问题描述
我正在尝试从搜索页面获取阿迪达斯鞋的链接,但无法弄清楚我做错了什么。
我试过 tags = soup.find("section", {"class": "productList"}).findAll("a")
不起作用:(
我还尝试打印所有href
内容,但所需的链接不在那里:(
所以我期待打印这个:
https://www.tennisexpress.com/adidas-mens-adizero-ubersonic-50-yrs-ltd-tennis-shoes-off-white-and-signal-blue-62138
from bs4 import BeautifulSoup
import requests
url = "https://www.tennisexpress.com/search.cfm?searchKeyword=BB6892"
# Getting the webpage, creating a Response object.
response = requests.get(url)
# Extracting the source code of the page.
data = response.text
# Passing the source code to BeautifulSoup to create a BeautifulSoup object for it.
soup = BeautifulSoup(data, 'lxml')
# Extracting all the <a> tags into a list.
tags = soup.find("section", {"class": "productList"}).findAll("a")
# Extracting URLs from the attribute href in the <a> tags.
for tag in tags:
print(tag.get('href'))
这是该链接的 html 代码
<section class="productList"> <article class="productListing"> <a class="product" href="//www.tennisexpress.com/adidas-mens-adizero-ubersonic-50-yrs-ltd-tennis-shoes-off-white-and-signal-blue-62138" title="Men`s Adizero Ubersonic 50 Yrs LTD Tennis Shoes Off White and Signal Blue" onmousedown="return nxt_repo.product_x('38698770','1');"> <span class="sale">SALE</span> <span class="image"> <img src="//www.tennisexpress.com/prodimages/78091-DEFAULT-m.jpg" alt="Men`s Adizero Ubersonic 50 Yrs LTD Tennis Shoes Off White and Signal Blue"> </span> <span class="brand"> Adidas </span> <span class="name"> Men`s Adizero Ubersonic 50 Yrs LTD Tennis Shoes Off White and Signal Blue </span> <span class="pricing"> <strong class="listPrice">$140.00</strong> <strong class="percentOff">0% OFF</strong> <strong class="salePrice">$139.95</strong> </span> <br> </a> </article> </section>
解决方案
通过检查 Chrome DevTools 中的 Network 选项卡,您可以注意到您搜索的产品是在向https://tennisexpress-com.ecomm-nav.com/search.js
. 您可以在此处查看示例响应。正如你所看到的,这是一团糟,所以我不会采用这种方法。
在您的代码中,您看不到产品,因为请求是在初始页面加载后由 JavaScript(在您的浏览器中运行)发出的。既不能独立urllib
也requests
不能呈现该内容。但是,您可以通过 JavaScript 支持来做到这Requests-HTML
一点(它在幕后使用 Chromium)。
代码:
from itertools import chain
from requests_html import HTMLSession
session = HTMLSession()
url = 'https://www.tennisexpress.com/search.cfm?searchKeyword=adidas+boost'
r = session.get(url)
r.html.render()
links = list(chain(*[prod.absolute_links for prod in r.html.find('.product')]))
我曾经chain
将所有具有绝对链接的集合连接在一起,并从中创建了一个列表。
>>> links
['https://www.tennisexpress.com/adidas-mens-barricade-2018-boost-tennis-shoes-black-and-night-metallic-62110',
'https://www.tennisexpress.com/adidas-mens-barricade-2018-boost-tennis-shoes-white-and-matte-silver-62109',
...
'https://www.tennisexpress.com/adidas-mens-supernova-glide-7-running-shoes-black-and-white-41636',
'https://www.tennisexpress.com/adidas-womens-adizero-boston-6-running-shoes-solar-yellow-and-midnight-gray-45268']
不要忘记使用pip install requests-html
.
推荐阅读
- google-analytics - 从 Google Analytics 中提取会员信息
- google-cloud-platform - 在不同的端口上发布 Ingress 健康检查
- sorting - 按当前或先前状态排序
- wordpress - 将我的域从 godaddy 指向 ec2 Ubuntu wordpress
- python - Python 在大整数矩阵乘积上的精度和性能
- ffmpeg - FFMPEG - 将 UInt16 数据转换为 .264
- node.js - 单元和 e2e 测试 grpc 微服务
- jose4j - jose4j JSONAware 的反面
- javascript - 关联数组:无法设置未定义的属性
- java - maven依赖地狱:java.lang.NoSuchMethodError