首页 > 解决方案 > Parsing Google Scholar results with Python and BeautifulSoup

问题描述

Given a typical keyword search in Google Scholar (see screenshot), I want to get a dictionary containing the title and url of each publication appearing on the page (eg. results = {'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells', 'url': 'https://www.nature.com/articles/338427a0'}.

enter image description here

To retrieve the results page from Google Scholar, I am using the following code:

from urllib import FancyURLopener, quote_plus
from bs4 import BeautifulSoup

class AppURLOpener(FancyURLopener):
    version = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/33.0.1750.152 Safari/537.36'

openurl = AppURLOpener().open
query = "Vicia faba"
url = 'https://scholar.google.com/scholar?q=' + quote_plus(query) + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
#print url
content = openurl(url).read()
page = BeautifulSoup(content, 'lxml')
print page

This code correctly returns the results page, in (very ugly) HTML format. However, I have not been been able to progress beyond this point, as I could not figure out how to use BeautifulSoup (to which I am not too much familiarized) to parse the results page and retrieve the data.

Notice that the issue is with the parsing of and extracting of data from the results page, not with Google Scholar itself, since the results page is correctly retrieved by the above code.

Could anyone please give a few hints? Thanks in advance!

标签: pythonbeautifulsoupgoogle-scholar

解决方案


检查页面内容显示搜索结果被包装在一个h3标签中,带有属性class="gs_rt"。您可以使用 BeautifulSoup 仅提取这些标签,然后从<a>每个条目内的标签中获取标题和 URL。将每个标题/ URL 写入字典,并存储在字典列表中:

import requests
from bs4 import BeautifulSoup

query = "Vicia%20faba"
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'

content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
    results.append({"title": entry.a.text, "url": entry.a['href']})

输出:

[{'title': 'Cytosolic calcium regulates ion channels in the plasma membrane of Vicia faba guard cells',
  'url': 'https://www.nature.com/articles/338427a0'},
 {'title': 'Hydrogen peroxide is involved in abscisic acid-induced stomatal closure in Vicia faba',
  'url': 'http://www.plantphysiol.org/content/126/4/1438.short'},
 ...]

注意:我使用requests而不是urllib,因为我urllib不会加载FancyURLopener. 但是 BeautifulSoup 的语法应该是一样的,不管你如何获取页面内容。


推荐阅读