首页 > 解决方案 > Web Scraping Google Scholar 作者简介

问题描述

我使用了学术包并解析了 3 个问题中生成的作者姓名,其方法是按作者姓名搜索,以获取作者简介,包括所有教授的所有引文信息。对于那些没有谷歌学者资料的人,我能够将数据加载到具有 NA 值的最终数据框中。但是,大约有一个问题。8 个作者的引文信息与 google 学术网站上的信息不匹配,这是因为学术包正在检索其他同名作者的引文信息。我相信我可以通过使用 search_author_id 函数来修复它,但问题是我们首先如何获得所有教授的 author_ids。

任何帮助,将不胜感激。

干杯,亚什

标签: web-scrapingweb-crawlergoogle-scholar

解决方案


该解决方案可能不适用于该scholarly软件包。beautifulsoup将被使用。

作者id's位于<a>标签href属性内的标签名称下。以下是我们如何获取他们的 id 的方法:

# assumes that request and soup are already sent and made

link = soup.select_one('.gs_ai_name a')['href']

# https://stackoverflow.com/a/6633693/15164646
_id = link

# looking for the text that contains "user=" to split it to 3 parts.
id_identifer = 'user='

# splitting text to 3 parts
before_keyword, keyword, after_keyword = _id.partition(id_identifer)

# after_keyword means that everything AFTER "user=" will be scraped, which is ID.
author_id = after_keyword

# RlANTZEAAAAJ

超出您的问题范围的代码(完整示例在bs4 文件夹 -> 下的在线 IDE 中get_profiles.py):

from bs4 import BeautifulSoup
import requests, lxml, os

headers = {
    'User-agent':
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19582"
}

proxies = {
  'http': os.getenv('HTTP_PROXY')
}

html = requests.get('https://scholar.google.com/citations?view_op=view_org&hl=en&org=9834965952280547731', headers=headers, proxies=proxies).text
soup = BeautifulSoup(html, 'lxml')

for result in soup.select('.gs_ai_chpr'):
  name = result.select_one('.gs_ai_name a').text
  link = result.select_one('.gs_ai_name a')['href']

  # https://stackoverflow.com/a/6633693/15164646
  _id = link
  id_identifer = 'user='
  before_keyword, keyword, after_keyword = _id.partition(id_identifer)
  author_id = after_keyword
  affiliations = result.select_one('.gs_ai_aff').text
  email = result.select_one('.gs_ai_eml').text

  try:
    interests = result.select_one('.gs_ai_one_int').text
  except:
    interests = None

  cited_by = result.select_one('.gs_ai_cby').text.split(' ')[2]
  
  print(f'{name}\nhttps://scholar.google.com{link}\n{author_id}\n{affiliations}\n{email}\n{interests}\n{cited_by}\n')

输出:

Jeong-Won Lee
https://scholar.google.com/citations?hl=en&user=D41VK7AAAAAJ
D41VK7AAAAAJ
Samsung Medical Center
Verified email at samsung.com
Gynecologic oncology
107516

或者,您可以使用来自 SerpApi 的Google Scholar Profiles API执行相同的操作,但无需考虑如何解决 CAPTCHA、查找代理以及随着时间的推移维护解析器。

这是一个带有免费计划的付费 API。

要集成的代码:

from serpapi import GoogleSearch
import os

params = {
    "api_key": os.getenv("API_KEY"),
    "engine": "google_scholar_profiles",
    "mauthors": "samsung"
}

search = GoogleSearch(params)
results = search.get_dict()

for result in results['profiles']:
  name = result['name']

  try:
    email = result['email']
  except:
    email = None

  author_id = result['author_id']
  affiliation = result['affiliations']
  cited_by = result['cited_by']
  interests = result['interests'][0]['title']
  interests_link = result['interests'][0]['link']

print(f'{name}\n{email}\n{author_id}\n{affiliation}\n{cited_by}\n{interests}\n{interests_link}\n')

部分输出:

Jeong-Won Lee
Verified email at samsung.com
D41VK7AAAAAJ
Samsung Medical Center
107516
Gynecologic oncology
https://scholar.google.com/citations?hl=en&view_op=search_authors&mauthors=label:gynecologic_oncology

免责声明,我为 SerpApi 工作。


推荐阅读