python - 使用 python 抓取谷歌精选片段
问题描述
https://www.google.com/search?q=LAPTOP+ACER+I3/4/1TB/8GEN+full+specs
例如:我想搜索这个产品并直接从特色片段中抓取它的规格。我如何将所有东西都放在那个盒子里?
解决方案
您可以通过以下方式实现:
requests-html
beautifulsoup
- 谷歌直接应答框 API
在在线 IDE 中使用 Requests-html 和示例:
from requests_html import HTMLSession
session = HTMLSession()
response = session.get('https://www.google.com/search?q=Acer+Aspire+3+A315-53+Laptop++specs')
specs_no_split = response.html.find('.Crs1tb', first=True).text
# splitting by a new line and grabbing every second value
specs_split = response.html.find('.Crs1tb', first=True).text.split('\n')[0::2]
# converting from list to a string
specs_filtered = ''.join(specs_split)
print(specs_no_split)
print(specs_split)
print(specs_filtered)
# output:
'''
Acer Aspire 3 A315-53 (NX.H38SI.002) Laptop (Core i3 8th Gen/4 GB/1 TB/Windows 10) Specifications
display type
LED
display size
15.6 Inches (39.62 cm)
display resolution
1920 x 1080 Pixels
display touchscreen
No
display features
Full HD LED Backlit IPS Display
ще 1 рядок
['Acer Aspire 3 A315-53 (NX.H38SI.002) Laptop (Core i3 8th Gen/4 GB/1 TB/Windows 10) Specifications', 'LED', '15.6 Inches (39.62 cm)', '1920 x 1080 Pixels', 'No', 'Full HD LED Backlit IPS Display']
Acer Aspire 3 A315-53 (NX.H38SI.002) Laptop (Core i3 8th Gen/4 GB/1 TB/Windows 10) SpecificationsLED15.6 Inches (39.62 cm)1920 x 1080 PixelsNoFull HD LED Backlit IPS Display
'''
在在线 IDE 中使用 BeautifulSoup 和示例:
from bs4 import BeautifulSoup
import requests, lxml
headers = {
'User-agent':
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('https://www.google.com/search?q=https://www.google.com/search?q=Acer+Aspire+3+A315-53+Laptop+specs, headers=headers).text
soup = BeautifulSoup(html, 'lxml')
specifications = soup.find('div', class_='wDYxhc').text
print(specifications)
# Output:
'''
Acer Aspire 3 A315-53-35ZY 15.6" Notebook, Intel i3, 4GB Memory, Windows 10Operating System. Windows 10.Hard Drive. HDD.Memory. 4GB RAM DDR4.Graphics card. Intel UHD Graphics 620.Processor. 2.2 Ghz Intel i3 processor.Display. 15.6-inch 1920 x 1080 resolution.
'''
或者,您可以使用来自 SerpApi 的Google Direct Answer Box API 。这是一个带有免费计划的付费 API。
不同之处在于您只需要考虑要提取的数据以及要使用的查询参数,而不是弄清楚如何绕过块或提取某些内容。
要集成的代码(在线 IDE 中的示例):
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"),
"engine": "google",
"q": "Acer Aspire 3 A315-53 Laptop specs",
"google_domain": "google.com",
}
search = GoogleSearch(params)
results = search.get_dict()
specs = results['answer_box']['contents']['formatted']
print(specs)
# output:
'''
[{'display_type': 'display size', 'led': '15.6 Inches (39.62 cm)'}, {'display_type': 'display resolution', 'led': '1920 x 1080 Pixels'}, {'display_type': 'display touchscreen', 'led': 'No'}, {'display_type': 'display features', 'led': 'Full HD LED Backlit IPS Display'}]
'''
免责声明,我为 SerpApi 工作。
推荐阅读
- apache-kafka - 什么 Serde 用于 Avro 数组类型?(卡夫卡)
- android - 从服务器加载时如何设置图像视图大小取决于屏幕大小?
- python - 如何使用python删除语料库中的人名
- c# - 从 postgres 数据库中获取所有 0 的 Guid 值
- c# - 使用 Selenium / Specflow 测试 Kendo Combobox
- c# - 我应该创建一个 SQL 非聚集索引,ASP.NET MVC
- flutter - Flutter StreamBuilder how i can run setState() when ConnectionState.done?
- c# - Discord.Net GetUser By Id 返回 null
- python - 将数据字段值显示为叶地图中的工具提示时出错
- postgresql - 是否可以将 Hibernate 配置为仅刷新但从不提交(一种提交模拟)