首页 > 解决方案 > 导致此特定错误消息的语法错误是什么?

问题描述

我正在R使用RStudio并且我有一个R执行网络抓取的脚本。运行这些特定行时,我遇到了一条错误消息:

      review<-ta1 %>%
              html_node("body") %>%
              xml_find_all("//div[contains@class,'location-review-review']")

错误信息如下:

xmlXPathEval: evaluation failed
Error in `*tmp*` - review : non-numeric argument to binary operator
In addition: Warning message:
In xpath_search(x$node, x$doc, xpath = xpath, nsMap = ns, num_results = Inf) :
  Invalid predicate [1206]

注意:我dplyr的脚本rvest中加载了库。R

我查看了以下问题的答案StackOverflow Non-numeric Argument to Binary Operator Error

我觉得我的解决方案与Richard Border对上面链接的问题提供的答案有关。但是,我很难R根据该答案弄清楚如何更正我的语法。

感谢您调查我的问题。

添加了 ta1 的样本:

{xml_document}
<html lang="en" xmlns:og="http://opengraphprotocol.org/schema/">
[1] <head>\n<meta http-equiv="content-type" content="text/html; charset=utf-8">\n<link rel="icon" id="favicon"  ...
[2] <body class="rebrand_2017 desktop_web Hotel_Review  js_logging" id="BODY_BLOCK_JQUERY_REFLOW" data-tab="TAB ...

标签: rweb-scrapingdplyrrvest

解决方案


我将在这里做一些假设,因为您的帖子没有包含足够的信息来生成可重复的示例。

首先,我猜您正在尝试抓取 TripAdvisor,因为 id 和 class 字段与该站点匹配,并且您的变量被称为ta1.

其次,我假设您正在尝试获取每个评论的文本和每个评论的星数,因为这是您似乎正在寻找的每个课程中相关的可抓取信息。

我需要先获取我自己的ta1变量版本,因为这无法从您编辑的版本中重现。

library(httr)
library(rvest)
library(xml2)
library(magrittr)
library(tibble)

"https://www.tripadvisor.co.uk/"                          %>% 
paste0("Hotel_Review-g186534-d192422-Reviews-")           %>%
paste0("Glasgow_Marriott_Hotel-Glasgow_Scotland.html") -> url

ta1 <- url %>% GET %>% read_html

现在为感兴趣的数据编写正确的 xpath

# xpath for elements whose text contains reviews
xpath1 <- "//div[contains(@class, 'location-review-review-list-parts-Expand')]"

# xpath for the elements whose class indicate the ratings
xpath2 <- "//div[contains(@class, 'location-review-review-')]"
xpath3 <- "/span[contains(@class, 'ui_bubble_rating bubble_')]"

我们可以像这样得到评论的文本:

ta1                                             %>% 
xml_find_all(xpath1)                            %>% # run first query
html_text()                                     %>% # extract text
extract(!equals(., "Read more")) -> reviews         # remove "blank" reviews

相关的星级评分如下:

ta1 %>% 
xml_find_all(paste0(xpath2, xpath3)) %>% 
xml_attr("class")                    %>% 
strsplit("_")                        %>%
lapply(function(x) x[length(x)])     %>% 
as.numeric                           %>% 
divide_by(10)                         -> stars

我们的结果如下所示:

tibble(rating = stars, review = reviews)
## A tibble: 5 x 2
#  rating review                                                                                             
#   <dbl> <chr>                                                                                              
#1      1 7 of us attended the Christmas Party on Satu~
#2      4 "We stayed 2 nights over last weekend to att~
#3      3 Had a good stay, but had no provision to kee~
#4      3 Booked an overnight for a Christmas shopping~
#5      4 Attended a charity lunch here on Friday and ~

推荐阅读