首页 > 解决方案 > 来自谷歌结果的 Python 抓取链接

问题描述

有什么方法可以从包含链接中特定单词的谷歌结果中抓取某些链接。通过使用 beautifulsoup 或 selenium ?

import requests 
from bs4 import BeautifulSoup 
import csv 

URL = "https://www.google.co.in/search?q=site%3Afacebook.com+friends+groups&oq=site%3Afacebook.com+friends+groups"
r = requests.get(URL) 

soup = BeautifulSoup(r.content, 'html5lib') 

想要提取包含组链接的链接。

标签: pythonbeautifulsoup

解决方案


不确定你想做什么,但如果你想从返回的内容中提取 facebook 链接,你可以检查是否facebook.com在 URL 内:

import requests 
from bs4 import BeautifulSoup 
import csv 
URL = "https://www.google.co.in/search?q=site%3Afacebook.com+friends+groups&oq=site%3Afacebook.com+friends+groups" 
r = requests.get(URL) 
soup = BeautifulSoup(r.text, 'html5lib')
for link in soup.findAll('a', href=True): 
    if 'facebook.com' in link.get('href'):
        print link.get('href')

更新: 还有另一种解决方法。您需要做的是设置一个合法的用户代理。因此添加标题以模拟浏览器。:

# This is a standard user-agent of Chrome browser running on Windows 10
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'
}

例子:

from bs4 import BeautifulSoup 
import requests 
URL = 'https://www.google.co.in/search?q=site%3Afacebook.com+friends+groups&oq=site%3Afacebook.com+friends+groups'
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36'}
resp = requests.get(URL, headers=headers).text 
soup = BeautifulSoup(resp, 'html.parser')
for link in soup.findAll('a', href=True): 
    if 'facebook.com' in link.get('href'):
        print link.get('href')

此外,您可以添加另一组标头来伪装成合法浏览器。添加更多这样的标题:

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/71.0.3578.98 Safari/537.36',
    'Accept' : 
    'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
    'Accept-Language' : 'en-US,en;q=0.5',
    'Accept-Encoding' : 'gzip',
    'DNT' : '1', # Do Not Track Request Header
    'Connection' : 'close'
}

推荐阅读