首页 > 解决方案 > 抓取谷歌搜索结果页面数据python

问题描述

我想在搜索结果查询中抓取电子邮件。但是当我使用 css 选择器“select”访问类并打印时,它总是显示空列表。我如何访问 .r 类或“class=g”?

    import requests
    from bs4 import BeautifulSoup
    
    url = "https://www.google.com/search?sxsrf=ACYBGNQA4leQETe0psVZPu7daLWbdsc9Ow%3A1579194494737&ei=fpggXpvRLMakwQKkqpSICg&q=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&oq=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&gs_l=psy-ab.12...0.0..7407...0.0..0.0.0.......0......gws-wiz.82okhpdJLYg&ved=0ahUKEwibiI_3zYjnAhVGUlAKHSQVBaEQ4dUDCAs"
    responce = requests.get(url)
    soup = BeautifulSoup(responce.text, "html.parser")
    test = soup.select('.r')
    print(test)

标签: pythonweb-scrapingbeautifulsouprequestpython-requests

解决方案


您的程序是正确的,但要从 Google 获得正确答案,您需要指定User-Agent标题:

来自 bs4 的导入请求 import BeautifulSoup

url = "https://www.google.com/search?sxsrf=ACYBGNQA4leQETe0psVZPu7daLWbdsc9Ow%3A1579194494737&ei=fpggXpvRLMakwQKkqpSICg&q=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&oq=%22computer+science+%22%22usa%22+%22%40yahoo.com%22&gs_l=psy-ab.12...0.0..7407...0.0..0.0.0.......0......gws-wiz.82okhpdJLYg&ved=0ahUKEwibiI_3zYjnAhVGUlAKHSQVBaEQ4dUDCAs"

headers = {'User-Agent':'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:72.0) Gecko/20100101 Firefox/72.0'}

responce = requests.get(url, headers=headers)  # <-- specify custom header
soup = BeautifulSoup(responce.text, "html.parser")
test = soup.select('.r')
print(test)

印刷:

[<div class="r"><a href="https://www.yahoo.com/news/11-course-complete-computer-science-171322233.html" onmousedown="return rwt(this,'','','','1','AOvVaw2wM4TUxc_4V7s9GjeWTNAG','','2ahUKEwjt17Kk-YjnAhW2R0EAHcnsC3QQFjAAegQIAxAB','','',event)"><div class="TbwUpd"><img alt="https://...
...

推荐阅读