首页 > 解决方案 > Web Scraping BoardGameGeek with RVest

问题描述

I'm pretty much brand new to web scraping with rvest.. and really new to most everything except Qlik coding.

I am attempting to scrape data found at board game geek, see the below link. Using inspect, it certainly seems possible, but yet rvest is not finding the tags. I first thought I had to go through the whole javascript process using V8 (javascript is called at the top of the html), but when I just use html_text on the whole document, all the information I need is in there.

*UPDATE: It appears to be in JSON. I used a combination of notepad++ and web tool to clean it and load into R. Any recommendations on tutorials/demos for how to do this systematically? I have all the links I need to loop through, but not sure how to go from the html_text output to a clean JSON input via code. *

I provided examples below, but I need to scrape the majority of the data elements available, so not looking for code to copy and paste but rather the best method to pursue. See below.

Link: https://boardgamegeek.com/boardgame/63888/innovation

HTML Example I am trying to pull from. Span returns nothing with html_nodes so I couldn't even start there.

<span ng-if="min > 0" class="ng-binding ng-scope">45</span>

OR

<a title="Civilization" ng-href="/boardgamecategory/1015/civilization" class="ng-binding" href="/boardgamecategory/1015/civilization">Civilization</a>

Javscript sections at top of page like this: about 8 of them:

<script type="text/javascript" src="https://cf.geekdo-static.com/static/geekcollection_master2_5e84926ab7e90.js"></script>

When I just use html_text on the whole object I can find see all the elements I am looking for e.g.:

\"minplaytime\":\"30\" OR {\"name\":\"Deck, Bag, and Pool Building\"

I'm assuming this is JSON? Is there a way to parse the html_text output, or another method? Is it easier just to rush the javascript at the top of the page using V8? Is there an easy guide for this?

标签: rweb-scrapingrvest

解决方案


Are you aware, that BGG has an API? Documentation can be found here: URL

The code will be provided as XML file. So for your example you can get the ID of your game - your example is 63888 (its in the URL). So the xml file can be found at: https://www.boardgamegeek.com/xmlapi2/thing?id=63888

You can read the info with this code:

library(dplyr)
library(rvest)

game_data <- read_xml("https://www.boardgamegeek.com/xmlapi2/thing?id=63888")
game_data %>% 
  html_nodes("name[type=primary]") %>% 
  html_attr("value") %>% 
  as.character()
#> [1] "Innovation"

By inspecting the xml file you can choose what node you want to export.

Created on 2020-04-06 by the reprex package (v0.3.0)


推荐阅读