r - 抓取问答可以正常工作,除非一篇帖子的答案超过一页
问题描述
下面的代码用他们的作者和日期刮掉所有的问题和答案,但我不知道如何把超过一页的答案也刮掉,例如这里的第二个问题
https://www.healthboards.com/boards/aspergers-syndrome/index2.html
阿斯伯格和自言自语
答案在 2 页中:第一页 15 个,第二页 3 个,我只在第一页得到答案
library(rvest)
library(dplyr)
library(stringr)
library(purrr)
library(tidyr)
library(RCurl)
library(xlsx)
#install.packages("xlsx")
# Scrape thread titles, thread links, authors and number of views
url <- "https://www.healthboards.com/boards/aspergers-syndrome/index2.html"
h <- read_html(url)
threads <- h %>%
html_nodes("#threadslist .alt1 div > a") %>%
html_text()
threads
thread_links <- h %>%
html_nodes("#threadslist .alt1 div > a") %>%
html_attr(name = "href")
thread_links
thread_starters <- h %>%
html_nodes("#threadslist .alt1 div.smallfont") %>%
html_text() %>%
str_replace_all(pattern = "\t|\r|\n", replacement = "")
thread_starters
views <- h %>%
html_nodes(".alt2:nth-child(6)") %>%
html_text() %>%
str_replace_all(pattern = ",", replacement = "") %>%
as.numeric()
# Custom functions to scrape author IDs and posts
scrape_posts <- function(link){
read_html(link) %>%
html_nodes(css = ".smallfont~ hr+ div") %>%
html_text() %>%
str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
str_trim()
}
scrape_dates <- function(link){
read_html(link) %>%
html_nodes(css = "table[id^='post'] td.thead:first-child") %>%
html_text() %>%
str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
str_trim()
}
scrape_author_ids <- function(link){
h <- read_html(link) %>%
html_nodes("div")
id_index <- h %>%
html_attr("id") %>%
str_which(pattern = "postmenu")
h %>%
`[`(id_index) %>%
html_text() %>%
str_replace_all(pattern = "\t|\r|\n", replacement = "") %>%
str_trim()
}
htmls <- map(thread_links, getURL)
# Create master dataset
master_data <-
tibble(threads, thread_starters,thread_links) %>%
mutate(
post_author_id = map(htmls, scrape_author_ids),
post = map(htmls, scrape_posts),
fec=map(htmls, scrape_dates)
) %>%
select(threads: post_author_id, post, thread_links,fec) %>%
unnest()
master_data$thread_starters
threads
post
titles<-master_data$threads
therad_starters<-master_data$thread_starters
#views<-master_data$views
post_author<-master_data$post_author_id
post<-master_data$post
fech<-master_data$fec
employ.data <- data.frame(titles, therad_starters, post_author, post,fech)
write.xlsx(employ.data, "C:/2.xlsx")
无法弄清楚如何也包括第二页..
解决方案
快速查看您的代码和网站,有一个包含页数的td
下类(在您的情况下,第 2 页,共 2 页)。vbmenu_control
您可以使用一些简单regex
的方法,例如
a = "page 2 of 2"
b = as.numeric(gsub("page 2 of ","",a))
> b
[1] 2
并添加条件 if b>1
。如果是这样TRUE
,您可以循环抓取链接...-talking-yourself - i.html,其中i是从序列 1 到页数的值。
推荐阅读
- jquery - JSON 多维数组到 HTML 表 Laravel
- r - 如果大数据集R中的物种名称不同,如何保留名称的第一部分
- hadoop - 修复损坏的 HDFS 文件而不丢失数据(datanode 中的文件仍然存在)
- powershell - Using Azure pipeline in yaml to loop through 2 variables simultaneously
- android - BT配对请求默认超时?
- r - 用 geom_density 制作的比例密度曲线到 geom_histogram 的相似高度?
- android - 使用 API 密钥或服务帐户实现 Google 语音凭据
- rust - Cargo init 像 cargo new 一样创建新目录
- arrays - 我无法在我的 char 数组中定义一个元素
- python - 来自 AWS Elastic Beanstalk 的 Mime 类型错误