首页 > 解决方案 > R Scraping IMDB:处理缺失信息的更好方法?

问题描述

我正在关注此网站以从 IMDB 获取信息:https ://www.analyticsvidhya.com/blog/2017/03/beginners-guide-on-web-scraping-in-r-using-rvest-with-hands-on -知识/

但是,IMDB 中缺少一些数据。该网站建议进行目视检查并编写如下函数:

for (i in c(39,73,80,89)){

a<-metascore_data[1:(i-1)]

b<-metascore_data[i:length(metascore_data)]

metascore_data<-append(a,list("NA"))

metascore_data<-append(metascore_data,b)

}

我想知道是否有更好的方法来以编程方式处理这个问题?

标签: rweb-scrapingmissing-datarvestimdb

解决方案


以下对我有用:

library(rvest)
URL <- 'https://www.imdb.com/search/title/?title_type=feature&online_availability=US/IMDbTV&start=1251&ref_=adv_nxt'
webpage <- read_html(URL)
genres <- webpage %>%
  html_nodes('span.genre') %>%
  html_text() %>%
  trimws()

这将返回 50 个值:

genres
# [1] "Comedy, Romance"              "Action, Crime, Drama"        
# [3] "Action, Horror, Sci-Fi"       "Action, Adventure, Thriller" 
# [5] "Adventure, Comedy, Family"    "Comedy"                      
# [7] "Action, Adventure, Thriller"  "Comedy, Drama, Romance"      
# [9] "Comedy"                       "Comedy"                      
#[11] "Action, Adventure, Drama"     "Action, Thriller"            
#[13] "Action, Crime, Thriller"      "Mystery, Thriller"           
#[15] "Crime, Drama, Thriller"       "Drama, Horror"               
#[17] "Animation, Drama, War"        "Drama, Thriller"             
#[19] "Action, Crime, Drama"         "Drama, Sci-Fi"               
#[21] "Adventure, Comedy, Family"    "Crime, Drama"                
#[23] "Action, Adventure, Thriller"  "Action, Adventure, Sci-Fi"   
#[25] "Thriller"                     "Comedy, Crime"               
#[27] "Comedy, Romance"              "Action, Biography, Drama"    
#[29] "Adventure, Comedy"            "Crime, Drama, Thriller"      
#[31] "Drama, Sci-Fi, Thriller"      "Comedy, Romance"             
#[33] "Action, Drama, Thriller"      "Action, Adventure, Sci-Fi"   
#[35] "Action, Crime, Drama"         "Action, Adventure, Drama"    
#[37] "Action, Thriller"             "Action, Drama, War"          
#[39] "Drama, Sci-Fi, Thriller"      "Animation, Adventure, Family"
#[41] "Drama, Romance"               "Action, Drama, Fantasy"      
#[43] "Action, Adventure, Fantasy"   "Comedy, Crime, Drama"        
#[45] "Action, Crime, Drama"         "Action, Adventure, Sci-Fi"   
#[47] "Drama, Romance"               "Animation, Family, Fantasy"  
#[49] "Action, Adventure, Fantasy"   "Mystery, Thriller"           

推荐阅读