python - Parsing Google Scholar results with Python and BeautifulSoup
问题描述
Given a typical keyword search in Google Scholar (see screenshot), I want to get a dictionary containing the title and url of each publication appearing on the page (eg. results = {'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells', 'url': 'https://www.nature.com/articles/338427a0'
}.
To retrieve the results page from Google Scholar, I am using the following code:
from urllib import FancyURLopener, quote_plus
from bs4 import BeautifulSoup
class AppURLOpener(FancyURLopener):
version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'
openurl = AppURLOpener().open
query = "Vicia faba"
url = 'https://scholar.google.com/scholar?q=' + quote_plus(query) + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
#print url
content = openurl(url).read()
page = BeautifulSoup(content, 'lxml')
print page
This code correctly returns the results page, in (very ugly) HTML format. However, I have not been been able to progress beyond this point, as I could not figure out how to use BeautifulSoup (to which I am not too much familiarized) to parse the results page and retrieve the data.
Notice that the issue is with the parsing of and extracting of data from the results page, not with Google Scholar itself, since the results page is correctly retrieved by the above code.
Could anyone please give a few hints? Thanks in advance!
解决方案
检查页面内容显示搜索结果被包装在一个h3
标签中,带有属性class="gs_rt"
。您可以使用 BeautifulSoup 仅提取这些标签,然后从<a>
每个条目内的标签中获取标题和 URL。将每个标题/ URL 写入字典,并存储在字典列表中:
import requests
from bs4 import BeautifulSoup
query = "Vicia%20faba"
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
results.append({"title": entry.a.text, "url": entry.a['href']})
输出:
[{'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells',
'url': 'https://www.nature.com/articles/338427a0'},
{'title': 'Hydrogen peroxide is involved in abscisic acid-induced stomatal closure in Vicia faba',
'url': 'http://www.plantphysiol.org/content/126/4/1438.short'},
...]
注意:我使用requests
而不是urllib
,因为我urllib
不会加载FancyURLopener
. 但是 BeautifulSoup 的语法应该是一样的,不管你如何获取页面内容。
推荐阅读
- android - 如何在 2 个片段之间传递 JSON 数据?
- java - 如何从 XQuery 结果中删除 xmlns="" 标记?
- git - 有没有办法将“git commit -a”配置为不包含子模块更改
- c++ - 友元函数的C++内联定义
- c++ - 运行makefile时如何修复“Unexpected token ... for architecture x86_64”错误
- django - Python Django 同时处理多个数据库
- python-3.6 - 如何让我的 Python3 字符串匹配代码忽略不匹配任何条件的文件?
- java - 如何返回对象的名称?
- django - 如何从同一个基于类的视图方法中的函数返回具有 super 的基于类的视图方法?
- javascript - 使 React 表单组可滚动