首页 > 解决方案 > 如何在网上抓取谷歌学者每年每篇论文的引用次数?

问题描述

我想绘制一个条形图,说明谷歌学者作者的 h-index 每年如何变化。为了计算这一点,我需要每年每篇论文的引用次数并计算每年的 h-index。

我设法在作者个人资料页面上获得了图表。以爱因斯坦的谷歌学者为例https://scholar.google.com/citations?user=qc6CJjYAAAAJ&hl=en,我得到了右边每年的被引次数图,但这是不正确的。我真正想要的是,当你点击一篇论文时,会有一个按年份划分的总引用数图表。我在 Python 中使用 BeautifulSoup 和 selenium 包。我现在最大的困难是:如果你查看一个作者的html代码,每篇论文的内容都是隐藏的,如何点击每篇论文并访问每篇论文的总引用数图表?

这是我为右边的图表所做的

def get_citation_by_year(url):
    s = soup(str(urllib.request.urlopen(url).read()), 'lxml')
    print(s)
    #print(s.title.text) #whose google scholar is this?
    years = list(map(int, [i.text for i in s.find_all('span', {'class':'gsc_g_t'})]))
    citation_number = list(map(int, [i.text for i in s.find_all('span', {'class':'gsc_g_al'})]))
    final_chart_data = dict(zip(years, citation_number))
    df = pd.DataFrame({'Year': years, 'Cited_By': citation_number})
    return(df)

单击 showmore 按钮以显示最大文章数:

def get_citation_byarticle_byyear(url):
    #quote_page is an URL of google scholar page of a specific author
    quote_page = url
    page = urlopen(quote_page)
    # Click Show more 
    chrome_options = Options()  
    chrome_options.add_argument("--headless")

    driver = webdriver.Chrome(chrome_options=chrome_options, executable_path=r"/Users/upcrown/Desktop/chromedriver") #need to download ChromeDriver, http://chromedriver.chromium.org/downloads

    driver.implicitly_wait(30)
    driver.get(url)

    python_button = driver.find_element_by_xpath('//*[@id="gsc_bpf_more"]')
    python_button.click() #click fhsu link

    time.sleep(5)
    # Selenium hands the page source to Beautiful Soup
    s = BeautifulSoup(driver.page_source, "html.parser")

    year = list(map(str, [i.text for i in s.find_all('span', {'class': 'gsc_a_h gsc_a_hc gs_ibl'})])) ##string not int because some are ''

    #find the paper
    #paper = soup.find_all("a", attrs={"class": "gsc_a_at"})
    paper = list(map(str, [i.text for i in s.find_all('a', {'class': 'gsc_a_at'})]))
    #find the citations 
    #citations = soup.find_all("a", attrs={"class":"gsc_a_ac gs_ibl"})
    citations = list(map(str, [i.text for i in s.find_all('a', {'class': 'gsc_a_ac gs_ibl'})]))

尝试过的其他工具:R“学者”包,没有每年每篇论文的引用计数,只有每年的引用计数。Windows 应用程序:发布或消亡(同样的问题)。Scopus API(没有作为谷歌学者的作者所有文章的完整列表)

标签: pythonweb-scrapinggoogle-scholar

解决方案


当您单击其中一篇文章时,您可以使用 SerpApi 等第三方解决方案来访问“弹出窗口”或引用。这是一个免费试用的付费 API。

示例 python 代码(也可在其他库中获得):

from serpapi import GoogleSearch

params = {
  "api_key": "SECRET_API_KEY",
  "engine": "google_scholar_author",
  "hl": "en",
  "author_id": "qc6CJjYAAAAJ",
  "citation_id": "qc6CJjYAAAAJ:qyhmnyLat1gC",
  "view_op": "view_citation"
}

search = GoogleSearch(params)
results = search.get_dict()

示例 JSON 输出:

"citation": {
  "title": "Rosen (1935)",
  "authors": "A Einstein, B Podolsky",
  "publication_date": "1964",
  "journal": "Physical Review",
  "volume": "47",
  "pages": "777",
  "total_citations": {
    "cited_by": {
      "total": 20216,
      "link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=8174092782678430881,4810886029029668500,2204829022686080230&as_sdt=5",
      "serpapi_link": "https://serpapi.com/search.json?cites=8174092782678430881%2C4810886029029668500%2C2204829022686080230&engine=google_scholar&hl=en",
      "cites_id": "8174092782678430881,4810886029029668500,2204829022686080230"
    },
    "table": [
      {
        "year": 1983,
        "citations": 68
      },
      {
        "year": 1984,
        "citations": 62
      },
      ...
    ]
  },
  "scholar_articles": [
    {
      "title": "Can quantum-mechanical description of physical reality be considered complete?",
      "link": "https://scholar.google.com/scholar?oi=bibs&cluster=8174092782678430881&btnI=1&hl=en",
      "authors": "A Einstein, B Podolsky, N Rosen - Physical review, 1935",
      "cited_by": {
        "total": 20195,
        "link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=8174092782678430881&as_sdt=5",
        "serpapi_link": "https://serpapi.com/search.json?cites=8174092782678430881&engine=google_scholar&hl=en",
        "cites_id": "8174092782678430881"
      },
      "related_pages_link": {
        "link": "https://scholar.google.com/scholar?oi=bibs&hl=en&q=related:odSh4BM2cHEJ:scholar.google.com/"
      },
      "versions": {
        "total": 96,
        "link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cluster=8174092782678430881",
        "serpapi_link": "https://serpapi.com/search.json?cluster=8174092782678430881&engine=google_scholar&hl=en",
        "cluster_id": "8174092782678430881"
      }
    },
    {
      "title": "Podolsky B Rosen N 1935",
      "link": "https://scholar.google.com/scholar?oi=bibs&cluster=4810886029029668500&btnI=1&hl=en",
      "authors": "A Einstein - Phys. Rev",
      "cited_by": {
        "total": 48,
        "link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=4810886029029668500&as_sdt=5",
        "serpapi_link": "https://serpapi.com/search.json?cites=4810886029029668500&engine=google_scholar&hl=en",
        "cites_id": "4810886029029668500"
      },
      "related_pages_link": {
        "link": "https://scholar.google.com/scholar?oi=bibs&hl=en&q=related:lPYzr1qzw0IJ:scholar.google.com/"
      }
    },
    {
      "title": "RosenN",
      "link": "https://scholar.google.com/scholar?oi=bibs&cluster=2204829022686080230&btnI=1&hl=en",
      "authors": "PB EinsteinA - Canquantum mechanical …",
      "cited_by": {
        "total": 16,
        "link": "https://scholar.google.com/scholar?oi=bibs&hl=en&cites=2204829022686080230&as_sdt=5",
        "serpapi_link": "https://serpapi.com/search.json?cites=2204829022686080230&engine=google_scholar&hl=en",
        "cites_id": "2204829022686080230"
      },
      "related_pages_link": {
        "link": "https://scholar.google.com/scholar?oi=bibs&hl=en&q=related:5uysf1AgmR4J:scholar.google.com/"
      }
    }
  ]
}

页面截图:

在此处输入图像描述

您可以查看文档以获取更多详细信息。

免责声明:我在 SerpApi 工作。


推荐阅读