r - 由于内容中的引用,使用 Rvest 进行网页抓取会丢失文本
问题描述
我正在尝试使用 webscraping 包rvest
从网站 eBird 中获取物种描述。我的问题是,由于内容中的引用,我认为描述文本被截断了。检查网页的来源和我正在寻找的标签,我看到:
<meta name="description" content="Small flycatcher with a big, peaked head and relatively long bill. Extremely similar to several other species, especially Alder and Willow Flycatchers. Greenish-olive above and pale whitish below. Thin white eyering. Dark wings with distinct white wingbars. Very long wingtips. Best distinguished from other flycatchers by habitat and voice. Birds near the northern end of range prefer shaded ravines with mix of hemlocks and deciduous trees; farther south, found in mature deciduous forests. Tends to stay high in the canopy. Song is an explosive "pwit-SIP;" call note is a sharp "pweek." ">
library(rvest)
library(dplyr)
url <- "https://ebird.org/species/acafly"
# Get list of metatag tags
metatags <- read_html(url) %>%
html_nodes('meta') %>%
html_attr('name')
# Get which row has the description
rownum <- which(metatags == "description")
# Get content from meta tags
content <- read_html(url) %>%
html_nodes('meta') %>%
html_attr('content')
# Get description content
description <- content[rownum]
我从以下代码中提取的描述给了我:
“小型捕蝇器,头部大而尖,喙相对较长。与其他几种捕蝇器极为相似,尤其是桤木和柳树捕蝇器。上面呈绿色橄榄色,下面呈淡白色。薄的白色眼环。深色翅膀,带有明显的白色翼条。翼尖很长. 与其他捕蝇器的最佳区别在于栖息地和声音。靠近分布范围北端的鸟类更喜欢有铁杉和落叶树混合的阴凉沟壑;更远的南部,在成熟的落叶林中发现。倾向于留在树冠的高处。歌曲是一种爆炸物"
但是,我真正想要的是:
“小型捕蝇器,头部大而尖,喙相对较长。与其他几种捕蝇器极为相似,尤其是桤木和柳树捕蝇器。上面是绿橄榄,下面是淡白色。薄的白色眼环。深色翅膀,带有明显的白色翼条。翼尖很长. 与其他捕蝇器的最佳区别在于栖息地和声音。靠近分布范围北端的鸟类更喜欢有铁杉和落叶树混合的阴凉沟壑;更远的南方,在成熟的落叶林中发现。倾向于留在树冠的高处。歌曲是一种爆炸物“pwit-SIP;”通话记录是一个尖锐的“pweek. "
如何获得包含引号的完整描述?
解决方案
您可以获得完整的描述,包括第一个p
带有 class 的标签的引号u-stack-sm
:
library(rvest)
library(dplyr)
url <- "https://ebird.org/species/acafly"
# Get description content
description <- read_html(url) %>%
html_nodes('p.u-stack-sm') %>%
html_text() %>%
.[[1]]
description
#> [1] "Small flycatcher with a big, peaked head and relatively long bill. Extremely similar to several other species, especially Alder and Willow Flycatchers. Greenish-olive above and pale whitish below. Thin white eyering. Dark wings with distinct white wingbars. Very long wingtips. Best distinguished from other flycatchers by habitat and voice. Birds near the northern end of range prefer shaded ravines with mix of hemlocks and deciduous trees; farther south, found in mature deciduous forests. Tends to stay high in the canopy. Song is an explosive \"pwit-SIP;\" call note is a sharp \"pweek.\"\n\r\n"
url <- "https://ebird.org/species/siltea1/"
description <- read_html(url) %>%
html_nodes('p.u-stack-sm') %>%
html_text() %>%
.[[1]]
description
#> [1] "Distinctive, but rather local and uncommon in Chile (more common in Argentina and elsewhere) in grassy wetlands, reedy marshes, and on lakes. Associates with other waterfowl, but usually is not out on open water and is easily overlooked. Readily identified by small size, dark cap, pale cheeks, and blue bill with yellow patch at base. Range does not overlap with the larger and more boldly patterned Puna Teal.\n\r\n\r\n\r\n"
由reprex 包(v0.3.0)于 2020 年 10 月 11 日创建
推荐阅读
- java - 检查firebase实时数据库中的数据不起作用
- jira-rest-api - 使用 Jira POST api (curl) 检索自定义字段值
- java - 如何在 Android Studio 中设置 Gradle Wrapper
- javascript - 使用复选框显示/隐藏标记传单
- docker - Docker 无法在 Windows 10 Enterprise VM 上启动
- java - 在理解递归函数调用方面需要帮助
- mysql - 无法打开多个到 MySQL 数据库的连接
- python - 使用 Python 自动化无聊的东西,第 6 章练习项目
- swift - 一个用于多个 ViewController 的网络处理程序类
- html-table - vue-native 应用程序替代 flatList 或 html 表等效项