r - 如何从网页中抓取选定的列表项?
问题描述
我正在尝试在marvel.wikia.com上用他们的角色(特色、支持、对手、其他)抓取漫威电影。现在这些字符存在于 DOM 中的列表中,我无法获得html_nodes()
获取每种字符类型下的所有列表项的权利。
以下代码提取了所有列出的链接,而我只想要属于特色-支持-拮抗剂-和其他角色的那些(不适用于 X2)。
library(rvest)
library(tidyverse)
test_url <- "http://marvel.wikia.com/wiki/X2_(film)"
read_html(test_url) %>%
html_nodes("li > a") %>%
html_text()
期望的结果:
# A tibble: 16 x 3
movie type character
<chr> <chr> <chr>
1 X2 Featured Characters Professor Charles Xavier
2 X2 Featured Characters Wolverine (Logan)
3 X2 Featured Characters Storm (Ororo Munroe)
4 X2 Featured Characters Dr. Jean Grey
5 X2 Featured Characters Cyclops (Scott Summers)
6 X2 Featured Characters Rogue (Marie)
7 X2 Featured Characters Iceman (Bobby Drake)
8 X2 Supporting Characters Nightcrawler (Kurt Wagner)
9 X2 Supporting Characters Pyro (John Allerdyce)
10 X2 Supporting Characters Mystique (Raven Darkholme)
11 X2 Supporting Characters Magneto (Erik Lehnsherr)
12 X2 Antagonists Col. William Stryker
13 X2 Antagonists Sgt. Lyman
14 X2 Antagonists Unnamed Soldiers
15 X2 Antagonists Deathstrike (Yuriko Oyama)
16 X2 Antagonists Mutant 143 (Jason Stryker)
解决方案
你可以从这样的事情开始 -
library(rvest)
library(tidyverse)
test_url <- "http://marvel.wikia.com/wiki/X2_(film)"
#scrape data
url_data <- read_html(test_url) %>%
html_nodes(xpath = '//*[@id="mw-content-text"]/ul') %>%
html_text()
#format scrapped data into desired format
df <- data.frame(movie = gsub(".*/", "", test_url),
type = c("Featured Characters", "Supporting_Characters", "Antagonists", "Other_Characters"),
characters = url_data[1:4]) %>%
separate_rows(characters, sep = "\\n")
这使
> head(df)
movie type characters
1 X2_(film) Featured Characters X-Men
2 X2_(film) Featured Characters Professor Charles Xavier
3 X2_(film) Featured Characters Wolverine (Logan)
4 X2_(film) Featured Characters Storm (Ororo Munroe)
5 X2_(film) Featured Characters Dr. Jean Grey (Apparent death)
6 X2_(film) Featured Characters Cyclops (Scott Summers)
推荐阅读
- ms-access - 在 MS Access 中打开 Pervasive 表时出现“不能多次定义字段”错误
- html - 显示内联元素的边框底部
- hibernate - 如何在 JPA 中为多对一关系配置级联类型?
- javascript - 使用 Javascript (ES6) 删除元素中的第一个字母
- url - url 是查询参数的有效值吗?
- java - 如何使用 iText 添加 PAdES-LTV
- php - 在 Woocommerce 中为特定用户角色设置最小订单量
- objective-c - Cocoa NSTextField 不可编辑,并随着 setSelectable 消失
- javascript - 获取 JavaScript 承诺值
- .net - 如何在 Ubuntu 18.04 上关闭 StreamWriter.WriteLine() 缓冲区?