首页 > 解决方案 > 如何在 rvest 中的函数中途更改 URL?

问题描述

我编写了这个函数来使用 rvest 抓取歌词网站:

library(rvest)
library(tidyverse)
library(magrittr)
library(scales)
library(lubridate)

songscrape <- function(x) {
  url <- paste0("https://www.azlyrics.com/", substring(x, 1, 1),"/",x, ".html")
  artist <- x
  
  SongsListScrapper <- function(x) { 
    page <- x
    songs <- page %>% 
      read_html() %>% 
      html_nodes(xpath = "/html/body/div[2]/div/div[2]/div[4]/div/a") %>% 
      html_text() %>% 
      as.data.frame()
    
    
    chart <- cbind(songs)
    names(chart) <- c("Songs")
    chart <- as.tibble(chart)
    return(chart)
  }
  
  SongsList <- map_df(url, SongsListScrapper)
  SongsList
  
  SongsList %<>%
    mutate(
      Songs = as.character(Songs) 
      ,Songs = gsub("[[:punct:]]", "", Songs) 
      ,Songs = tolower(Songs) 
      ,Songs = gsub(" ", "", Songs) 
    )
  
  SongsList$Songs
  
  #Scrape Lyrics 
  
  wipe_html <- function(str_html) { 
    gsub("<.*?>", "", str_html)
  }
  
  lyrics2 <- c()
  albums2 <- c()
  number <- 1
  
  for(i in seq_along(SongsList$Songs)) { 
    for_url_name <- SongsList$Songs[i]
    
    
    #clean name
    for_url_name <- tolower(gsub("[[:punct:]]\\s", "", for_url_name))
    #create url
    paste_url <- paste0("https://www.azlyrics.com/lyrics/", artist,"/", for_url_name, ".html")
    
    #open connection to url 
    for_html_code <-read_html(paste_url)
    for_lyrics <- html_node(for_html_code, xpath = "/html/body/div[2]/div/div[2]/div[5]")
    for_albums <- html_node(for_html_code, xpath = "/html/body/div[2]/div/div[2]/div[11]/div[1]/b")
    for_lyrics <- wipe_html(for_lyrics)
    for_albums <- wipe_html(for_albums)
    lyrics2[number] <- for_lyrics
    albums2[number] <- for_albums
    
    number <- number +1
    
    show(paste0(for_url_name, " scrape complete!"))
    
    Sys.sleep(10)
  }
  
  songs2 <- cbind(lyrics2, albums2) %>% as.data.frame()
  songs2$albums2 <-  gsub("[[:punct:]]", "", songs$albums2)
  
  return(songs2)
}

此函数接受艺术家姓名的输入并将其应用于url变量,然后抓取相关页面。

但是,我遇到了为艺术家抓取的问题ironwine。他们通常的链接结构是https://www.azlyrics.com/lyrics/ironwine/songname.html,但正如您在他们的歌曲列表页面上看到的那样在专辑 'Years to Burn' 之后,他们的链接结构变为https://www.azlyrics.com/lyrics/calexico/songname.html

唯一的变化是艺术家姓名变量,它是函数的输入。

这发生在这 6 首歌曲之后,它又回到ironwine.

我怎样才能适应这种变化,使功能不会中途停止?

对于xpath那个 div 是/html/body/div[2]/div/div[2]/div[4]/div[180]所以我尝试插入这个:

#number is updated at the end of the loop, so number = 180 should be this song. 
if(number >= 180 && <=186) { 
artist = "calexico"
}

理论上应该改变网址。然而,这不会发生。有人可以告诉我如何解决这个问题吗?

标签: rweb-scrapingrvest

解决方案


推荐阅读