首页 > 解决方案 > Scraping the Nevada Assembly/Senate site with R + V8

问题描述

I am trying to scrape the Nevada state legislature webpages (i.e. the tables of assemblypersons and senators and their personal pages) and am slowly being driven mad. It looks straightforward: there are tables that are present in HTML when you examine the source code. Except as best I can tell they are being created using Javascript queries and this is new to me.

I've tried some of the work-arounds, like this stack exchange question, but cannot seem to find one. I'm now trying to follow the directions here and here to no avail. This is for a paid gig and I'm using up way too many billable hours. I'm about to just manually fill in the darn data, but I figure learning this now could save future headaches.

When I look for scripts, I find 11, but only the first and last have any text within them. When I attempt to call the last script with ct$eval(), I get the error message "Error in context_eval(join(src), private$context) : ReferenceError: $ is not defined"

> read_html(link) %>% html_nodes("script") {xml_nodeset (11)} [1]
<script type="text/javascript">
  \
  r\ n
  var SiteTitle = "Legislator I ... [2] < script src = "/App/Legislator/A/Scripts/jquery-1.8.1.js" >
</script>
[3]
<script src="/App/Legislator/A/Scripts/DataTables/jquery.dataTables.js">
  ...[4] < script src = "/App/Legislator/A/Scripts/jquery.fancybox.pack.js" >
</script>
[5]
<script src="/App/Legislator/A/Scripts/jquery.fancybox-buttons.js">
  < /scr ... [6] < script src = "/App/Legislator/A/Scripts/jquery.fancybox-media.js" >
</script>
[7]
<script src="/App/Legislator/A/Scripts/jquery.fancybox-thumbs.js"></script>
[8]
<script src="/App/Legislator/A/Scripts/bootstrap.js"></script>
[9]
<script src="/App/Legislator/A/Scripts/DateFormat.js"></script>
[10]
<script src="/App/Legislator/A/Scripts/LCB.js"></script>
[11]
<script type="text/javascript">
  \
  r\ n\ t $(function() {\
        r\ n\ t\ t //console ...

#Loading both the required libraries
library(rvest)
library(V8)
#URL with js-rendered content to be scraped
link <- "https://www.leg.state.nv.us/App/Legislator/A/Senate/Current/1"
#Read the html page content and extract all javascript codes that are inside a list
jscript <- read_html(link) %>% html_nodes('script') %>% html_text()
# Create a new v8 context
ct <- v8()
#parse the html content from the js output and print it as text
read_html(ct$eval(jscript[11])) %>% 
  html_text()

I'm stuck! suggestions appreciated!

标签: javascriptrweb-scraping

解决方案


The data for landing page and personal pages comes dynamically from API calls you can find in the network tab.

The landing page gets its info from the following call and returns everything you need as json

https://www.leg.state.nv.us/App/Legislator/A/api/Current/Legislator?house=Assembly

If you loop through the list returned you can extract the MemberID for each member and concatenate that into a second API call which returns all the info from their personal page.

`https://www.leg.state.nv.us/App/Legislator/A/api/Current/Legislator?id={member["MemberID"]}`

You could paste0 in R to do this.

I generate a dictionary with script below which has as its keys the district number then associated with that key is all the member info from the landing page as the associated value along with the json returned from the 'personal page'.


R version:

library(jsonlite)
library(collections)

members <- jsonlite::read_json('https://www.leg.state.nv.us/App/Legislator/A/api/Current/Legislator?house=Assembly')
d <- Dict(items = NULL)

for(member in members){
  district <- member$DistrictNbr
  personal_url <- paste0('https://www.leg.state.nv.us/App/Legislator/A/api/Current/Legislator?id=', member$MemberID)
  personal_data <- jsonlite::read_json(personal_url)
  d$set(district, c(member,personal_data))
}

print(d$keys()) #dict keys
print(d$get("1")) #district 1
print(d$get("1")("legislatorCareerInfo")) #example personal info

Py

import requests

results = {}

with requests.Session() as s:
    r = s.get('https://www.leg.state.nv.us/App/Legislator/A/api/Current/Legislator?house=Assembly').json()
    for member in r:
        district = member['DistrictNbr']
        results[district] = member
        r2 = s.get(f'https://www.leg.state.nv.us/App/Legislator/A/api/Current/Legislator?id={member["MemberID"]}').json()
        results[district]['personal_page'] = r2
print(results)

References:

  1. collections package

推荐阅读