首页 > 解决方案 > 如何从网页中抓取选定的列表项?

问题描述

我正在尝试在marvel.wikia.com上用他们的角色(特色、支持、对手、其他)抓取漫威电影。现在这些字符存在于 DOM 中的列表中,我无法获得html_nodes()获取每种字符类型下的所有列表项的权利。

以下代码提取了所有列出的链接,而我只想要属于特色-支持-拮抗剂-和其他角色的那些(不适用于 X2)。

library(rvest)
library(tidyverse)

test_url <- "http://marvel.wikia.com/wiki/X2_(film)"

read_html(test_url) %>%
  html_nodes("li > a") %>%
  html_text() 

期望的结果:

# A tibble: 16 x 3
   movie type                  character                  
   <chr> <chr>                 <chr>                      
 1 X2    Featured Characters   Professor Charles Xavier   
 2 X2    Featured Characters   Wolverine (Logan)          
 3 X2    Featured Characters   Storm (Ororo Munroe)       
 4 X2    Featured Characters   Dr. Jean Grey              
 5 X2    Featured Characters   Cyclops (Scott Summers)    
 6 X2    Featured Characters   Rogue (Marie)              
 7 X2    Featured Characters   Iceman (Bobby Drake)       
 8 X2    Supporting Characters Nightcrawler (Kurt Wagner) 
 9 X2    Supporting Characters Pyro (John Allerdyce)      
10 X2    Supporting Characters Mystique (Raven Darkholme) 
11 X2    Supporting Characters Magneto (Erik Lehnsherr)   
12 X2    Antagonists           Col. William Stryker       
13 X2    Antagonists           Sgt. Lyman                 
14 X2    Antagonists           Unnamed Soldiers           
15 X2    Antagonists           Deathstrike (Yuriko Oyama) 
16 X2    Antagonists           Mutant 143 (Jason Stryker)

标签: rweb-scrapingrvest

解决方案


你可以从这样的事情开始 -

library(rvest)
library(tidyverse)

test_url <- "http://marvel.wikia.com/wiki/X2_(film)"

#scrape data
url_data <- read_html(test_url) %>%
  html_nodes(xpath = '//*[@id="mw-content-text"]/ul') %>%
  html_text()

#format scrapped data into desired format
df <- data.frame(movie = gsub(".*/", "", test_url),
                 type = c("Featured Characters", "Supporting_Characters", "Antagonists", "Other_Characters"),
                 characters = url_data[1:4]) %>%
  separate_rows(characters, sep = "\\n")

这使

> head(df)
      movie                type                         characters
1 X2_(film) Featured Characters                             X-Men 
2 X2_(film) Featured Characters          Professor Charles Xavier 
3 X2_(film) Featured Characters                 Wolverine (Logan) 
4 X2_(film) Featured Characters              Storm (Ororo Munroe) 
5 X2_(film) Featured Characters   Dr. Jean Grey   (Apparent death)
6 X2_(film) Featured Characters           Cyclops (Scott Summers) 

推荐阅读