首页 > 解决方案 > 无法通过 Web Scraping 从 Google 搜索页面中抓取所有链接

问题描述

我是网络抓取的初学者。最近我尝试从 Google SERP 的搜索结果中抓取域名。

为此,我使用 Requests、Beautiful Soup 和 Regex 来获取页面、解析标签并查看 href 并使用 regex 匹配来提取域名。

执行此操作时,输出中缺少一些链接。问题似乎是请求没有完全获取页面,因为我将获取的文本与 Chrome 上的源代码进行了比较(缺失的标签存在于缺失的代码中)。我想知道可能是什么原因!

import requests
from bs4 import BeautifulSoup
import re

url = "https://www.google.com/search?q=glass+beads+india"
r = requests.get(url)
page = r.text 
soup = BeautifulSoup(page, 'lxml') 

i = 0

link_list = []
for tag in soup.find_all('a'):
    i+=1
    href = tag['href']
    if re.search('http',href):
        try:
            link = re.search('https://.+\.com',href).group(0)
            link_list.append(link)
        except:
            pass

link_list = list(set(link_list))

link_list2 = [] 

for link in link_list:
    if not re.search('google.com',link):
        link_list2.append(link)
        
print(link_list2)

标签: pythonweb-scrapingbeautifulsouppython-requestspython-requests-html

解决方案


这可能是因为您没有指定user-agentaka requestsheaders,因此 Google 会阻止请求,并且您会收到带有错误消息或类似内容的页面。检查你的用户代理是什么

通过一个user-agent

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

html = requests.get('YOUR URL', headers=headers)

使用SelectorGadget Chrome 扩展查找所有链接以获取CSS选择器CSS选择器参考):

# container with all needed data
for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  displayed_link = result.select_one('.TbwUpd.NJjxre').text

匹配不包括“ www. ”部分的域和子域:

>>> re.findall(r'^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', link)
'etsy.com'

在线 IDE 中的代码和完整示例:

import requests, lxml
from bs4 import BeautifulSoup

headers = {
    "User-Agent":
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}

params = {
  'q': 'glass beads india',  # search query
  'hl': 'en',                # language
  'num': '100'               # number of results
}

html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')

# container with all needed data
for result in soup.select('.tF2Cxc'):
  link = result.select_one('.yuRUbf a')['href']
  displayed_link = result.select_one('.TbwUpd.NJjxre').text

  # https://stackoverflow.com/a/25703406/15164646
  domain_name = ''.join(re.findall(r'^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', link))

  print(link)
  print(displayed_link)
  print(domain_name)
  print('---------------')


'''
https://www.etsy.com/market/india_glass_beads
https://www.etsy.com › market › india_glass_beads
etsy.com
---------------
https://www.etsy.com/market/indian_glass_beads
https://www.etsy.com › market › indian_glass_beads
etsy.com
---------------
https://www.amazon.com/glass-indian-beads/s?k=glass+indian+beads
https://www.amazon.com › glass-indian-beads › k=glass...
amazon.com
---------------
'''

或者,您可以使用来自 SerpApi的Google Organic Results API来实现相同的目的。这是一个带有免费计划的付费 API。

主要区别在于您只需要从结构化 JSON 中迭代和提取数据。

要集成的代码:

from serpapi import GoogleSearch
import os

params = {
  "api_key": os.getenv("API_KEY"), # environment variable
  "engine": "google",
  "q": "glass beads india",
  "hl": "en",
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['organic_results']:
    link = result['link']
    displayed_link = result['displayed_link']
    domain_name = ''.join(re.findall(r'^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', link))
    
    print(link)
    print(displayed_link)
    print(domain_name)
    print('---------------')


'''
https://www.etsy.com/market/india_glass_beads
https://www.etsy.com › market › india_glass_beads
etsy.com
---------------
https://www.etsy.com/market/indian_glass_beads
https://www.etsy.com › market › indian_glass_beads
etsy.com
---------------
https://www.amazon.com/glass-indian-beads/s?k=glass+indian+beads
https://www.amazon.com › glass-indian-beads › k=glass...
amazon.com
---------------
'''

免责声明我为 SerpApi 工作。


推荐阅读