python - 从 Google Scholar 搜索结果中抓取和解析引文信息
问题描述
我有一个大约 20000 篇文章标题的列表,我想从谷歌学者那里获取他们的引用数。我是 BeautifulSoup 库的新手。我有这个代码:
import requests
from bs4 import BeautifulSoup
query = ['Role for migratory wild birds in the global spread of avian
influenza H5N8','Uncoupling conformational states from activity in an
allosteric enzyme','Technological Analysis of the World’s Earliest
Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer
Headdress from the Early Holocene Site of Star Carr, North Yorkshire,
UK','Oxidative potential of PM 2.5 during Atlanta rush hour:
Measurements of in-vehicle dithiothreitol (DTT) activity','Primary
Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-
wrapped Graphene and Their Oxygen Reduction Activity','Relations of
Preschoolers Visual-Motor and Object Manipulation Skills With Executive
Function and Social Behavior','We Know Who Likes Us, but Not Who Competes
Against Us']
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-
8&hl=en&btnG=Search'
content = requests.get(url).text
page = BeautifulSoup(content, 'lxml')
results = []
for entry in page.find_all("h3", attrs={"class": "gs_rt"}):
results.append({"title": entry.a.text, "url": entry.a['href']})
但它只返回标题和网址。我不知道如何从另一个标签获取引文信息。请帮帮我。
解决方案
您需要循环列表。您可以使用 Session 来提高效率。以下是 bs 4.7.1,它支持:contains
用于查找引用计数的伪类。看起来您可以从 css 选择器中删除类型选择器,并在ieh3
之前使用 class 。如果您没有 4.7.1。您可以使用来选择引用计数。a
.gs_rt a
[title=Cite] + a
import requests
from bs4 import BeautifulSoup as bs
queries = ['Role for migratory wild birds in the global spread of avian influenza H5N8',
'Uncoupling conformational states from activity in an allosteric enzyme',
'Technological Analysis of the World’s Earliest Shamanic Costume: A Multi-Scalar, Experimental Study of a Red Deer Headdress from the Early Holocene Site of Star Carr, North Yorkshire, UK',
'Oxidative potential of PM 2.5 during Atlanta rush hour: Measurements of in-vehicle dithiothreitol (DTT) activity',
'Primary Prevention of CVD','Growth and Deposition of Au Nanoclusters on Polymer-wrapped Graphene and Their Oxygen Reduction Activity',
'Relations of Preschoolers Visual-Motor and Object Manipulation Skills With Executive Function and Social Behavior',
'We Know Who Likes Us, but Not Who Competes Against Us']
with requests.Session() as s:
for query in queries:
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
r = s.get(url)
soup = bs(r.content, 'lxml') # or 'html.parser'
title = soup.select_one('h3.gs_rt a').text if soup.select_one('h3.gs_rt a') is not None else 'No title'
link = soup.select_one('h3.gs_rt a')['href'] if title != 'No title' else 'No link'
citations = soup.select_one('a:contains("Cited by")').text if soup.select_one('a:contains("Cited by")') is not None else 'No citation count'
print(title, link, citations)
< 4.7.1 的替代方案。
with requests.Session() as s:
for query in queries:
url = 'https://scholar.google.com/scholar?q=' + query + '&ie=UTF-8&oe=UTF-8&hl=en&btnG=Search'
r = s.get(url)
soup = bs(r.content, 'lxml') # or 'html.parser'
title = soup.select_one('.gs_rt a')
if title is None:
title = 'No title'
link = 'No link'
else:
link = title['href']
title = title.text
citations = soup.select_one('[title=Cite] + a')
if citations is None:
citations = 'No citation count'
else:
citations = citations.text
print(title, link, citations)
感谢@facelessuser 的评论,重新编写了底部版本。顶部版本留作比较:
在单行 if 语句中不调用 select_one 两次可能会更有效。当模式构建被缓存时,返回的标签不会被缓存。我个人会将变量设置为 select_one 返回的任何内容,然后,仅当变量为 None 时,将其更改为 No link 或 No title 等。它不那么紧凑,但它会更有效
[...] 始终检查 if tag 是否为 None: 而不仅仅是 if tag:。使用选择器,这没什么大不了的,因为它们只会返回标签,但是如果您曾经在 tag.descendants 中执行类似 for x 之类的操作:您会得到文本节点(字符串)和标签,即使空字符串也会评估为 false它是一个有效的节点。在这种情况下,检查 None 是最安全的
推荐阅读
- java - eclipse 中的线程“restartedMain”异常
- php - 如何使用 laravel-echo-server 在 laravel 广播中通过私有通道接收套接字
- sql - 获取其他“对象/记录”作为另一个“列/属性”的属性
- python - 在没有互联网连接的 windows server 2016 中安装 python 包
- javascript - 为什么两个 JS 对象实例化的日期不同?
- java - Java - 从数据库中获取用户名
- xamarin - 使用 C# 后端时,有什么方法可以加快向页面添加新元素的速度?
- c# - 错误的数字将int转换为字符串c#
- python - 如何在不重写的情况下以符号形式打印函数?
- sql - 一对多加入