首页 > 解决方案 > 由于内容中的引用,使用 Rvest 进行网页抓取会丢失文本

问题描述

我正在尝试使用 webscraping 包rvest从网站 eBird 中获取物种描述。我的问题是,由于内容中的引用,我认为描述文本被截断了。检查网页的来源和我正在寻找的标签,我看到:

<meta name="description" content="Small flycatcher with a big, peaked head and relatively long bill. Extremely similar to several other species, especially Alder and Willow Flycatchers. Greenish-olive above and pale whitish below. Thin white eyering. Dark wings with distinct white wingbars. Very long wingtips. Best distinguished from other flycatchers by habitat and voice. Birds near the northern end of range prefer shaded ravines with mix of hemlocks and deciduous trees; farther south, found in mature deciduous forests. Tends to stay high in the canopy. Song is an explosive "pwit-SIP;" call note is a sharp "pweek." ">
library(rvest)
library(dplyr)

url <- "https://ebird.org/species/acafly"

# Get list of metatag tags 
metatags <- read_html(url) %>% 
  html_nodes('meta') %>% 
  html_attr('name')

# Get which row has the description
rownum <- which(metatags == "description")

# Get content from meta tags
content <- read_html(url) %>% 
  html_nodes('meta') %>% 
  html_attr('content') 

# Get description content
description <- content[rownum]

我从以下代码中提取的描述给了我:

“小型捕蝇器,头部大而尖,喙相对较长。与其他几种捕蝇器极为相似,尤其是桤木和柳树捕蝇器。上面呈绿色橄榄色,下面呈淡白色。薄的白色眼环。深色翅膀,带有明显的白色翼条。翼尖很长. 与其他捕蝇器的最佳区别在于栖息地和声音。靠近分布范围北端的鸟类更喜欢有铁杉和落叶树混合的阴凉沟壑;更远的南部,在成熟的落叶林中发现。倾向于留在树冠的高处。歌曲是一种爆炸​​物"

但是,我真正想要的是:

“小型捕蝇器,头部大而尖,喙相对较长。与其他几种捕蝇器极为相似,尤其是桤木和柳树捕蝇器。上面是绿橄榄,下面是淡白色。薄的白色眼环。深色翅膀,带有明显的白色翼条。翼尖很长. 与其他捕蝇器的最佳区别在于栖息地和声音。靠近分布范围北端的鸟类更喜欢有铁杉和落叶树混合的阴凉沟壑;更远的南方,在成熟的落叶林中发现。倾向于留在树冠的高处。歌曲是一种爆炸​​物“pwit-SIP;”通话记录是一个尖锐的“pweek. "

如何获得包含引号的完整描述?

标签: rweb-scrapingrvest

解决方案


您可以获得完整的描述,包括第一个p带有 class 的标签的引号u-stack-sm

library(rvest)
library(dplyr)

url <- "https://ebird.org/species/acafly"

# Get description content
description <- read_html(url) %>% 
  html_nodes('p.u-stack-sm') %>% 
  html_text() %>% 
  .[[1]]
description
#> [1] "Small flycatcher with a big, peaked head and relatively long bill. Extremely similar to several other species, especially Alder and Willow Flycatchers. Greenish-olive above and pale whitish below. Thin white eyering. Dark wings with distinct white wingbars. Very long wingtips. Best distinguished from other flycatchers by habitat and voice. Birds near the northern end of range prefer shaded ravines with mix of hemlocks and deciduous trees; farther south, found in mature deciduous forests. Tends to stay high in the canopy. Song is an explosive \"pwit-SIP;\" call note is a sharp \"pweek.\"\n\r\n"

url <- "https://ebird.org/species/siltea1/"

description <- read_html(url) %>% 
  html_nodes('p.u-stack-sm') %>% 
  html_text() %>% 
  .[[1]]
description
#> [1] "Distinctive, but rather local and uncommon in Chile (more common in Argentina and elsewhere) in grassy wetlands, reedy marshes, and on lakes. Associates with other waterfowl, but usually is not out on open water and is easily overlooked. Readily identified by small size, dark cap, pale cheeks, and blue bill with yellow patch at base. Range does not overlap with the larger and more boldly patterned Puna Teal.\n\r\n\r\n\r\n"

reprex 包(v0.3.0)于 2020 年 10 月 11 日创建


推荐阅读