r - 使用 R 抓取 Javascript 呈现的内容
问题描述
我正在尝试使用 R 对所有类型的优惠券(标题、图片、描述、到期以及它所属的过滤器)进行分类和跟踪。我认为它是 javascript,因此基本的抓取工具不起作用。
有没有办法留在R中并做到这一点(不精通其他系统)
尝试按照以下内容进行操作-但似乎无法正常工作
https://datascienceplus.com/scraping-javascript-rendered-web-content-using-r/
编辑
library(rvest)
coupon <- read_html("kroger.com/cl/coupons/")
coupon <- coupon %>% + html_nodes(".Text--bold") %>%
html_text()
coupon
也试过这个:
#Loading both the required libraries
library(rvest)
library(V8)
#URL with js-rendered content to be scraped
link <- 'kroger.com/cl/coupons/'
#Read the html page content and extract all javascript codes that are inside a list
emailjs <- read_html(kroger.com/cl/coupons) %>% html_nodes('li') %>%
html_nodes('script') %>% html_text()
# Create a new v8 context
ct <- v8()
#parse the html content from the js output and print it as text
read_html(ct$eval(gsub('document.write','',emailjs))) %>% html_text()
解决方案
虽然它使用 javascript,但它发送 JSON。您可以通过使用隐藏的 api 来避免使用 javascript:
library(rvest)
library(jsonlite)
my_url <- "https://www.kroger.com/cl/api/coupons?couponsCountPerLoad=418&sortType=relevance&newCoupons=false" #hidden api
pagesource <- read_html(my_url)
content<- pagesource %>% html_node("p") %>% html_text()
data <- fromJSON(content)
mydata <- data$data$coupons
> glimpse(mydata)
Observations: 418
Variables: 19
$ id <int> 2149194, 2149191, 2127870, 2129277, 2128587, 2126349, 2121480, 2128278, 2157633, 2169615, 2159613, 2140047, 2159769, 2167485, 2141526...
$ brandName <chr> "Other", "Other", "Store Brand", "Store Brand", "Store Brand", "Store Brand", "Sargento", "Hallmark", "Colgate", "Oscar Mayer", "Kett...
$ longDescription <chr> "Selling or purchasing fuel points is prohibited. Fuel redemption offer cannot be combined with any other discounts. No discounts to ...
$ shortDescription <chr> "Get 4x FUEL Points on FRI - SAT - SUN Only", "Get 4x FUEL Points on FRI - SAT - SUN Only", "2x Fuel Points", "Save $0.50 on 2 Kroger...
$ requirementDescription <chr> "when you buy a participating gift card. *Restrictions apply, see store for details.", "when you buy a $25, $50 or $100 Mastercard® o...
$ categories <list> ["Gift Cards", "Gift Cards", "General", "Snacks", <"Promotions", "Frozen">, "General", "Dairy", "General", <"Baking Goods", "Health ...
$ expirationDate <chr> "2018-05-13T04:00:00Z", "2018-05-13T04:00:00Z", "2018-07-29T04:00:00Z", "2018-05-26T04:00:00Z", "2018-05-26T04:00:00Z", "2018-05-29T0...
$ lastRedemptionDate <chr> "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "", "...
$ displayStartDate <chr> "2018-05-07T04:00:00Z", "2018-05-07T04:00:00Z", "2018-04-30T04:00:00Z", "2018-04-18T04:00:00Z", "2018-04-18T04:00:00Z", "2018-05-02T0...
$ imageUrl <chr> "https://cdnws.softcoin.com/mediaCache/ecoupon_1585374.png", "https://cdnws.softcoin.com/mediaCache/ecoupon_1585365.png", "https://cd...
$ krogerCouponNumber <chr> "800000013010", "800000013711", "10000008220", "800000012111", "800000012554", "800000014782", "800000015150", "800000022503", "80000...
$ addedToCard <lgl> FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, FALSE, ...
$ canBeAddedToCard <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T...
$ canBeRemoved <lgl> TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, TRUE, T...
$ filterTags <list> [<"FT4XGRAD", "FTBL4XGRADFM", "FTBL4XGRAD", "FTBL4XMOMGC", "FTBL4XMOM1", "4XGCWEEKEND", "FTBL4XGRAD2", "KPF">, <"FTBL4XGRAD1", "4XGC...
$ title <chr> "Get 4x FUEL Points on FRI - SAT - SUN Only", "Get 4x FUEL Points on FRI - SAT - SUN Only", "2x Fuel Points", "Save 50¢", "Save 50¢",...
$ displayDescription <chr> "", "", "", "on 2 Kroger Potato Chips", "on 2 Kroger Deluxe Ice Cream", "", "on Sargento® Blends™ Slices", "on 2 Hallmark Cards", "on...
$ redemptionsAllowed <int> -1, -1, -1, -1, -1, 1, 1, 1, 1, 1, 1, 1, 5, 1, 1, 1, 1, 1, -1, 1, -1, 1, 1, 1, 1, 1, -1, 1, 5, 1, 1, -1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
$ value <dbl> 1.00, 1.00, 1.00, 0.50, 0.50, 20.00, 0.75, 1.00, 0.50, 1.25, 1.00, 0.50, 1.49, 1.00, 1.00, 1.00, 0.75, 2.00, 0.50, 0.50, 1.00, 1.00, ...
推荐阅读
- android - Android wifiManager getScanResults 总是返回 null
- c# - 没有检测到碰撞(unity c#)
- wpf - 如何在 WPF CellTemplate 中使用 DisplayMemberPath?
- javascript - 如何在Angular中的数组内将对象推入数组的第零索引
- django - 在 django 中使用多个字段计数(不同)
- python - 当我在自然语言处理中使用 TF-IDF 时,它说列表不可调用。你能帮我吗?
- html - 如何使用 HTML 和 CSS 将一个 div 重叠在另一个 div 上
- java - 二进制搜索(1 + log(n))而不是log(n)中的最大键比较如何?
- php - php中的存储库模式可以这样写where语句吗?
- python - Python Selenium 通过具有未知父级的 css 选择器查找子级