javascript - Scraping the Nevada Assembly/Senate site with R + V8
问题描述
I am trying to scrape the Nevada state legislature webpages (i.e. the tables of assemblypersons and senators and their personal pages) and am slowly being driven mad. It looks straightforward: there are tables that are present in HTML when you examine the source code. Except as best I can tell they are being created using Javascript queries and this is new to me.
I've tried some of the work-arounds, like this stack exchange question, but cannot seem to find one. I'm now trying to follow the directions here and here to no avail. This is for a paid gig and I'm using up way too many billable hours. I'm about to just manually fill in the darn data, but I figure learning this now could save future headaches.
When I look for scripts, I find 11, but only the first and last have any text within them. When I attempt to call the last script with ct$eval(), I get the error message "Error in context_eval(join(src), private$context) : ReferenceError: $ is not defined"
> read_html(link) %>% html_nodes("script") {xml_nodeset (11)} [1]
<script type="text/javascript">
\
r\ n
var SiteTitle = "Legislator I ... [2] < script src = "/App/Legislator/A/Scripts/jquery-1.8.1.js" >
</script>
[3]
<script src="/App/Legislator/A/Scripts/DataTables/jquery.dataTables.js">
...[4] < script src = "/App/Legislator/A/Scripts/jquery.fancybox.pack.js" >
</script>
[5]
<script src="/App/Legislator/A/Scripts/jquery.fancybox-buttons.js">
< /scr ... [6] < script src = "/App/Legislator/A/Scripts/jquery.fancybox-media.js" >
</script>
[7]
<script src="/App/Legislator/A/Scripts/jquery.fancybox-thumbs.js"></script>
[8]
<script src="/App/Legislator/A/Scripts/bootstrap.js"></script>
[9]
<script src="/App/Legislator/A/Scripts/DateFormat.js"></script>
[10]
<script src="/App/Legislator/A/Scripts/LCB.js"></script>
[11]
<script type="text/javascript">
\
r\ n\ t $(function() {\
r\ n\ t\ t //console ...
#Loading both the required libraries
library(rvest)
library(V8)
#URL with js-rendered content to be scraped
link <- "https://www.leg.state.nv.us/App/Legislator/A/Senate/Current/1"
#Read the html page content and extract all javascript codes that are inside a list
jscript <- read_html(link) %>% html_nodes('script') %>% html_text()
# Create a new v8 context
ct <- v8()
#parse the html content from the js output and print it as text
read_html(ct$eval(jscript[11])) %>%
html_text()
I'm stuck! suggestions appreciated!
解决方案
The data for landing page and personal pages comes dynamically from API calls you can find in the network tab.
The landing page gets its info from the following call and returns everything you need as json
https://www.leg.state.nv.us/App/Legislator/A/api/Current/Legislator?house=Assembly
If you loop through the list returned you can extract the MemberID
for each member and concatenate that into a second API call which returns all the info from their personal page.
`https://www.leg.state.nv.us/App/Legislator/A/api/Current/Legislator?id={member["MemberID"]}`
You could paste0
in R to do this.
I generate a dictionary with script below which has as its keys the district number then associated with that key is all the member info from the landing page as the associated value along with the json returned from the 'personal page'
.
R version:
library(jsonlite)
library(collections)
members <- jsonlite::read_json('https://www.leg.state.nv.us/App/Legislator/A/api/Current/Legislator?house=Assembly')
d <- Dict(items = NULL)
for(member in members){
district <- member$DistrictNbr
personal_url <- paste0('https://www.leg.state.nv.us/App/Legislator/A/api/Current/Legislator?id=', member$MemberID)
personal_data <- jsonlite::read_json(personal_url)
d$set(district, c(member,personal_data))
}
print(d$keys()) #dict keys
print(d$get("1")) #district 1
print(d$get("1")("legislatorCareerInfo")) #example personal info
Py
import requests
results = {}
with requests.Session() as s:
r = s.get('https://www.leg.state.nv.us/App/Legislator/A/api/Current/Legislator?house=Assembly').json()
for member in r:
district = member['DistrictNbr']
results[district] = member
r2 = s.get(f'https://www.leg.state.nv.us/App/Legislator/A/api/Current/Legislator?id={member["MemberID"]}').json()
results[district]['personal_page'] = r2
print(results)
References:
推荐阅读
- python-3.x - 使用 Python 从 Microsoft Teams 抓取文件
- r - 为什么我的 r-markdown docx 输出“reference_docx:headingfive.docx”不起作用?
- javascript - 如何以编程方式使身体睡眠和醒来?
- python - 如何处理时间序列预测模型中的不频繁数据
- javascript - 寻找路径避免软碰撞
- tensorflow - 为什么keras使用“call”而不是__call__?
- java - 带有 Springboot 2.2 的 MockMvc 给出 406
- javascript - 如何使用 Vanilla Js 拖放存储已放置项目的状态?
- gcc - 这两个链接描述文件部分有什么区别?
- serialization - Inserting data into database returns MismatchedInputException error