首页 > 解决方案 > 带有特殊字符的 rvest jump_to

问题描述

我很难让 rvest 跳转到其中包含特殊字符的 jump_to url。当我将链接键入 chrome 时,它​​可以工作,但在 R / rvest 中出现错误:

curl::curl_fetch_memory(url, handle=handle) 中的错误:
无法解析主机:NA

有问题的网址:

http://incrediblewinestore.com/ProductDetail.asp?title= -You-Had-Me-At-Merlot--Napkins&UPCCode=876718049392

http://incrediblewinestore.com/ProductDetail.asp?title=10-BARREL-RASPBERRY-CRUSH-6PK&UPCCode=`851538002611 _

http://incrediblewinestore.com/ProductDetail.asp?title=14-HANDS-CABERNET-SAUVIGNON&UPCCode= \088586001895

有效的网址:

http://incrediblewinestore.com/ProductDetail.asp?title=Cuarenta-y-Tres-Liqueur-43&UPCCode=029929115411

我试过的代码:

library(stringr)
library(rvest)
# Load first page, try to go to search, but expect age-check
iws_ac_url <- "http://incrediblewinestore.com"
iws_session <- html_session(iws_ac_url)

age_gate <- iws_session %>% 
  html_node("form[name='AgeGate']")

age_gate <- html_form(age_gate)

age_gate <- set_values(age_gate, PageAction = 'Yes21')

# Submit form and enter the rest of the site
iws_site <- submit_form(iws_session,age_gate)

# Unworking Links
temp_link <- paste0("http://incrediblewinestore.com","/ProductDetail.asp?title=<i>-You-Had-Me-At-Merlot-<i>-Napkins&UPCCode=876718049392")
iws_site %>% jump_to(temp_link)

temp_link <- paste0("http://incrediblewinestore.com","/ProductDetail.asp?title=10-BARREL-RASPBERRY-CRUSH-6PK&UPCCode=`851538002611")
iws_site %>% jump_to(temp_link)

# Working link
temp_link <- paste0("http://incrediblewinestore.com","/ProductDetail.asp?title=Cuarenta-y-Tres-Liqueur-43&UPCCode=029929115411")
iws_site %>% jump_to(temp_link)

标签: rrvest

解决方案


像往常一样,一旦我找到答案,我就对它的简单性感到震惊。只需要函数名:URLencode(url,reserved = FALSE)

temp_link <- paste0("http://incrediblewinestore.com",URLencode("/ProductDetail.asp?title=10-BARREL-RASPBERRY-CRUSH-6PK&UPCCode=`851538002611",reserved = FALSE))

秘密是我需要一个不会编码保留字符的函数,例如 = ?& . 我尝试的另一个功能是转换所有字符。


推荐阅读