首页 > 解决方案 > 使用 rvest 包在 R 中抓取博客文章

问题描述

对于一个大学项目,我想抓取 Instagram 博客(https://about.instagram.com/blog/announcements/break-down-how-instagram-search-works)的博客文章。获取文章的标题、日期和作者没有问题,但是当我尝试获取实际的文章文本时,它什么也没返回。有人知道可能是什么问题吗?

这是我的代码:

require ("rvest")
require ("stringr")
require ("tidyverse")
library (tidyverse)
library (rvest)
library (stringr)

### set variable to save url ###
url <- 'https://about.instagram.com/blog/announcements/break-down-how-instagram-search-works'

### scrape title of blog entry ###
titles <- read_html(url) %>% 
  html_nodes('h1') %>%
  html_text()

### scrape author and date into a vector ###
author_date <- read_html(url) %>% 
  html_nodes ('._8hlt') %>%
  html_nodes ('._8hlu') #%>%
  html_text()

### separate author and date from vector into single character variables ###
author <- author_date [1]  
date <- author_date [2]

### scrape article text. does not work unfortunately. any idea why? ###
text <- read_html(url) %>%
  html_nodes ("._8ig0 _8g86") %>%
  html_nodes ("._8g86 _9g5w _8iq8 _8ipi") %>%
  html_text()

标签: rrvest

解决方案


使用此代码获取段落中的所有文本。

text2 <- read_html(url) %>%
  html_nodes (xpath="//p[.//text()]") %>%
  html_text()

推荐阅读