python - 无法通过 Web Scraping 从 Google 搜索页面中抓取所有链接
问题描述
我是网络抓取的初学者。最近我尝试从 Google SERP 的搜索结果中抓取域名。
为此,我使用 Requests、Beautiful Soup 和 Regex 来获取页面、解析标签并查看 href 并使用 regex 匹配来提取域名。
执行此操作时,输出中缺少一些链接。问题似乎是请求没有完全获取页面,因为我将获取的文本与 Chrome 上的源代码进行了比较(缺失的标签存在于缺失的代码中)。我想知道可能是什么原因!
import requests
from bs4 import BeautifulSoup
import re
url = "https://www.google.com/search?q=glass+beads+india"
r = requests.get(url)
page = r.text
soup = BeautifulSoup(page, 'lxml')
i = 0
link_list = []
for tag in soup.find_all('a'):
i+=1
href = tag['href']
if re.search('http',href):
try:
link = re.search('https://.+\.com',href).group(0)
link_list.append(link)
except:
pass
link_list = list(set(link_list))
link_list2 = []
for link in link_list:
if not re.search('google.com',link):
link_list2.append(link)
print(link_list2)
解决方案
这可能是因为您没有指定user-agent
aka requestsheaders
,因此 Google 会阻止请求,并且您会收到带有错误消息或类似内容的页面。检查你的用户代理是什么。
通过一个user-agent
:
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
html = requests.get('YOUR URL', headers=headers)
使用SelectorGadget Chrome 扩展查找所有链接以获取CSS
选择器(CSS
选择器参考):
# container with all needed data
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
displayed_link = result.select_one('.TbwUpd.NJjxre').text
匹配不包括“ www. ”部分的域和子域:
>>> re.findall(r'^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', link)
'etsy.com'
在线 IDE 中的代码和完整示例:
import requests, lxml
from bs4 import BeautifulSoup
headers = {
"User-Agent":
"Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3538.102 Safari/537.36 Edge/18.19582"
}
params = {
'q': 'glass beads india', # search query
'hl': 'en', # language
'num': '100' # number of results
}
html = requests.get('https://www.google.com/search', headers=headers, params=params)
soup = BeautifulSoup(html.text, 'lxml')
# container with all needed data
for result in soup.select('.tF2Cxc'):
link = result.select_one('.yuRUbf a')['href']
displayed_link = result.select_one('.TbwUpd.NJjxre').text
# https://stackoverflow.com/a/25703406/15164646
domain_name = ''.join(re.findall(r'^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', link))
print(link)
print(displayed_link)
print(domain_name)
print('---------------')
'''
https://www.etsy.com/market/india_glass_beads
https://www.etsy.com › market › india_glass_beads
etsy.com
---------------
https://www.etsy.com/market/indian_glass_beads
https://www.etsy.com › market › indian_glass_beads
etsy.com
---------------
https://www.amazon.com/glass-indian-beads/s?k=glass+indian+beads
https://www.amazon.com › glass-indian-beads › k=glass...
amazon.com
---------------
'''
或者,您可以使用来自 SerpApi的Google Organic Results API来实现相同的目的。这是一个带有免费计划的付费 API。
主要区别在于您只需要从结构化 JSON 中迭代和提取数据。
要集成的代码:
from serpapi import GoogleSearch
import os
params = {
"api_key": os.getenv("API_KEY"), # environment variable
"engine": "google",
"q": "glass beads india",
"hl": "en",
}
search = GoogleSearch(params)
results = search.get_dict()
for result in results['organic_results']:
link = result['link']
displayed_link = result['displayed_link']
domain_name = ''.join(re.findall(r'^(?:https?:\/\/)?(?:[^@\/\n]+@)?(?:www\.)?([^:\/?\n]+)', link))
print(link)
print(displayed_link)
print(domain_name)
print('---------------')
'''
https://www.etsy.com/market/india_glass_beads
https://www.etsy.com › market › india_glass_beads
etsy.com
---------------
https://www.etsy.com/market/indian_glass_beads
https://www.etsy.com › market › indian_glass_beads
etsy.com
---------------
https://www.amazon.com/glass-indian-beads/s?k=glass+indian+beads
https://www.amazon.com › glass-indian-beads › k=glass...
amazon.com
---------------
'''
免责声明我为 SerpApi 工作。
推荐阅读
- python - 使用python中的循环过滤数据框列并根据条件连接并将列写入多个excel
- php - 如果在 0 和 php 中字符串的最后一个索引处可用,我如何删除花引号?
- r - 如何在 R 中总结 pam 聚类结果?
- wso2-am - WS02 API 管理器故障和错误消息有效负载结构更改
- tcl - 如何在 unix 管道中使用 Tcl 脚本?
- computer-vision - 如何使用对象检测进行银行支票详细信息提取?
- java - 如果更新的客户端请求中不存在保存的对象,是否有删除数据库条目?
- angular - 如何将物化组件与.net mvc与角度js路由一起使用
- algorithm - 自调用数组返回php
- ruby-on-rails - Rails Capybara 没有检测到弹性项目的 vivsiblity