首页 > 解决方案 > 使用 python 抓取谷歌精选片段

问题描述

https://www.google.com/search?q=LAPTOP+ACER+I3/4/1TB/8GEN+full+specs

例如:我想搜索这个产品并直接从特色片段中抓取它的规格。我如何将所有东西都放在那个盒子里?

标签: pythondjangodatabasebeautifulsouppython-requests

解决方案


您可以通过以下方式实现:

  • requests-html
  • beautifulsoup
  • 谷歌直接应答框 API

在在线 IDE 中使用 Requests-html 和示例:

from requests_html import HTMLSession

session = HTMLSession()
response = session.get('https://www.google.com/search?q=Acer+Aspire+3+A315-53+Laptop++specs')

specs_no_split = response.html.find('.Crs1tb', first=True).text
# splitting by a new line and grabbing every second value
specs_split = response.html.find('.Crs1tb', first=True).text.split('\n')[0::2]
# converting from list to a string
specs_filtered = ''.join(specs_split)
print(specs_no_split)
print(specs_split)
print(specs_filtered)

# output:
'''
Acer Aspire 3 A315-53 (NX.H38SI.002) Laptop (Core i3 8th Gen/4 GB/1 TB/Windows 10) Specifications
display type
LED
display size
15.6 Inches (39.62 cm)
display resolution
1920 x 1080 Pixels
display touchscreen
No
display features
Full HD LED Backlit IPS Display
ще 1 рядок

['Acer Aspire 3 A315-53 (NX.H38SI.002) Laptop (Core i3 8th Gen/4 GB/1 TB/Windows 10) Specifications', 'LED', '15.6 Inches (39.62 cm)', '1920 x 1080 Pixels', 'No', 'Full HD LED Backlit IPS Display']

Acer Aspire 3 A315-53 (NX.H38SI.002) Laptop (Core i3 8th Gen/4 GB/1 TB/Windows 10) SpecificationsLED15.6 Inches (39.62 cm)1920 x 1080 PixelsNoFull HD LED Backlit IPS Display
'''

在在线 IDE 中使用 BeautifulSoup 和示例:

from bs4 import BeautifulSoup
import requests, lxml

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('https://www.google.com/search?q=https://www.google.com/search?q=Acer+Aspire+3+A315-53+Laptop+specs, headers=headers).text
soup = BeautifulSoup(html, 'lxml')

specifications = soup.find('div', class_='wDYxhc').text
print(specifications)

# Output:
'''
Acer Aspire 3 A315-53-35ZY 15.6" Notebook, Intel i3, 4GB Memory, Windows 10Operating System. Windows 10.Hard Drive. HDD.Memory. 4GB RAM DDR4.Graphics card. Intel UHD Graphics 620.Processor. 2.2 Ghz Intel i3 processor.Display. 15.6-inch 1920 x 1080 resolution.
'''

或者,您可以使用来自 SerpApi 的Google Direct Answer Box API 。这是一个带有免费计划的付费 API。

不同之处在于您只需要考虑要提取的数据以及要使用的查询参数,而不是弄清楚如何绕过块或提取某些内容。

要集成的代码(在线 IDE 中的示例):

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"),
  "engine": "google",
  "q": "Acer Aspire 3 A315-53 Laptop specs",
  "google_domain": "google.com",
}

search = GoogleSearch(params)
results = search.get_dict()

specs = results['answer_box']['contents']['formatted']
print(specs)

# output:
'''
[{'display_type': 'display size', 'led': '15.6 Inches (39.62 cm)'}, {'display_type': 'display resolution', 'led': '1920 x 1080 Pixels'}, {'display_type': 'display touchscreen', 'led': 'No'}, {'display_type': 'display features', 'led': 'Full HD LED Backlit IPS Display'}]
'''

免责声明,我为 SerpApi 工作。


推荐阅读