r - Web Scraping BoardGameGeek with RVest
问题描述
I'm pretty much brand new to web scraping with rvest.. and really new to most everything except Qlik coding.
I am attempting to scrape data found at board game geek, see the below link. Using inspect, it certainly seems possible, but yet rvest is not finding the tags. I first thought I had to go through the whole javascript process using V8 (javascript is called at the top of the html), but when I just use html_text on the whole document, all the information I need is in there.
*UPDATE: It appears to be in JSON. I used a combination of notepad++ and web tool to clean it and load into R. Any recommendations on tutorials/demos for how to do this systematically? I have all the links I need to loop through, but not sure how to go from the html_text output to a clean JSON input via code. *
I provided examples below, but I need to scrape the majority of the data elements available, so not looking for code to copy and paste but rather the best method to pursue. See below.
Link: https://boardgamegeek.com/boardgame/63888/innovation
HTML Example I am trying to pull from. Span returns nothing with html_nodes so I couldn't even start there.
<span ng-if="min > 0" class="ng-binding ng-scope">45</span>
OR
<a title="Civilization" ng-href="/boardgamecategory/1015/civilization" class="ng-binding" href="/boardgamecategory/1015/civilization">Civilization</a>
Javscript sections at top of page like this: about 8 of them:
<script type="text/javascript" src="https://cf.geekdo-static.com/static/geekcollection_master2_5e84926ab7e90.js"></script>
When I just use html_text on the whole object I can find see all the elements I am looking for e.g.:
\"minplaytime\":\"30\" OR {\"name\":\"Deck, Bag, and Pool Building\"
I'm assuming this is JSON? Is there a way to parse the html_text output, or another method? Is it easier just to rush the javascript at the top of the page using V8? Is there an easy guide for this?
解决方案
Are you aware, that BGG has an API? Documentation can be found here: URL
The code will be provided as XML file. So for your example you can get the ID of your game - your example is 63888 (its in the URL). So the xml file can be found at: https://www.boardgamegeek.com/xmlapi2/thing?id=63888
You can read the info with this code:
library(dplyr)
library(rvest)
game_data <- read_xml("https://www.boardgamegeek.com/xmlapi2/thing?id=63888")
game_data %>%
html_nodes("name[type=primary]") %>%
html_attr("value") %>%
as.character()
#> [1] "Innovation"
By inspecting the xml file you can choose what node you want to export.
Created on 2020-04-06 by the reprex package (v0.3.0)
推荐阅读
- c++ - 正确完成数组内的数组计算的问题
- java - JavaFX 对话框在关闭后出现阻塞,未正确关闭
- javascript - 我对 2 个不同的输出感到困惑,因为 setTimeout 是异步的,fn 定义应该在回调执行时更新为“2”
- web - Service Worker 可以拦截来自 Web Worker 的 http 请求吗?
- c - 16位对象数组的memcpy可以在两者之间中断吗
- python - 生成组合,使总数始终为 100,并使用定义的跳跃值
- json - 如何在导入json文件期间将mongodb中唯一名称的一阶键设置为id
- python - 通过比较散列密码登录
- java - 为什么在链表的节点中插入值时需要一个临时节点?
- javascript - npm 中的 ERR_SSL_WRONG_VERSION_NUMBER