html - HTML table scraping in R
问题描述
I'm trying to obtain the table on the following url: https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20SIN%20MDA%20Hor%202018-12-31%20v2019%2003%2001_01%2000%2001.html
The problem is that the table is not an html table, so html_table() doesn´t work.
So far, I've tried extracting the node from table but it returns nothing
url = "https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html"
webpage <- read_html(url)
table_html <- html_nodes(webpage, 'table#Tabc')
table <- html_table(table_html)
解决方案
所以这里的问题是页面是通过javascript呈现的。因此rvest
,单独是行不通的。最简单的方法之一是使用无头网络浏览器。我们可以使用PhantomJS
.
首先,下载适当的版本PhantomJS
并将可执行文件(假设是 Windows)放在您的工作目录中。也就是说,它位于脚本phantomjs.exe
的工作目录中。R
创建一个scrape.js
文件:
// scrape.js
var webPage = require('webpage');
var page = webPage.create();
var fs = require('fs');
var path = 'page.html';
page.open('https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html', function (status) {
var content = page.content;
fs.write(path,content,'w');
phantom.exit();
});
此scrape.js
文件一旦运行,将page.html
在您的工作目录中创建一个文件。返回R
或RStudio
您可以执行以下操作:
library(tidyverse)
library(rvest)
# Run scrape.js with PhantomJS to create the file page.html
system("./phantomjs scrape.js")
# Now we should be in business as usual:
read_html('page.html') %>%
html_nodes("table#Tabc") %>%
html_table(header = TRUE) %>%
.[[1]] %>%
as_tibble()
# A tibble: 504 x 38
Codigo `Estatus asigna~ Hora `Limite de desp~ `Limite de desp~ `Costo de Opera~ `Bloque de Pote~ `Costo Incremen~ `Bloque de Pote~
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BTY5W~ ECO 1 35 20 43212. 1.5 1762. 1.5
2 BTY5W~ ECO 2 35 20 43212. 1.5 1762. 1.5
3 BTY5W~ ECO 3 35 20 43212. 1.5 1762. 1.5
4 BTY5W~ ECO 4 35 20 43212. 1.5 1762. 1.5
5 BTY5W~ ECO 5 35 20 43212. 1.5 1762. 1.5
6 BTY5W~ ECO 6 35 20 43212. 1.5 1762. 1.5
7 BTY5W~ ECO 7 35 20 43212. 1.5 1762. 1.5
8 BTY5W~ ECO 8 35 20 43212. 1.5 1762. 1.5
9 BTY5W~ ECO 9 35 20 43212. 1.5 1762. 1.5
10 BTY5W~ ECO 10 35 20 43212. 1.5 1762. 1.5
# ... with 494 more rows, and 29 more variables: `Costo Incremental de generacion Bloque 02 ($/MWh)` <dbl>, `Bloque de Potencia 03 (MW)` <dbl>,
# `Costo Incremental de generacion Bloque 03 ($/MWh)` <dbl>, `Bloque de Potencia 04 (MW)` <dbl>, `Costo Incremental de generacion Bloque 04
# ($/MWh)` <dbl>, `Bloque de Potencia 05 (MW)` <dbl>, `Costo Incremental de generacion Bloque 05 ($/MWh)` <dbl>, `Bloque de Potencia 06
# (MW)` <dbl>, `Costo Incremental de generacion Bloque 06 ($/MWh)` <dbl>, `Bloque de Potencia 07 (MW)` <dbl>, `Costo Incremental de generacion
# Bloque 07 ($/MWh)` <dbl>, `Bloque de Potencia 08 (MW)` <dbl>, `Costo Incremental de generacion Bloque 08 ($/MWh)` <dbl>, `Bloque de Potencia
# 09 (MW)` <dbl>, `Costo Incremental de generacion Bloque 09 ($/MWh)` <dbl>, `Bloque de Potencia 10 (MW)` <dbl>, `Costo Incremental de
# generacion Bloque 10 ($/MWh)` <dbl>, `Bloque de Potencia 11 (MW)` <dbl>, `Costo Incremental de generacion Bloque 11 ($/MWh)` <dbl>, `Reserva
# rodante 10 min (MW)` <dbl>, `Costo Reserva rodante 10 min ($/MW)` <dbl>, `Reserva no rodante 10 min (MW)` <dbl>, `Costo Reserva no rodante 10
# min ($/MW)` <dbl>, `Reserva rodante suplementaria (MW)` <dbl>, `Costo Reserva rodante suplementaria ($/MW)` <dbl>, `Reserva no rodante
# suplementaria (MW)` <dbl>, `Costo Reserva no rodante suplementaria ($/MW)` <dbl>, `Reserva regulacion secundaria (MW)` <dbl>, `Costo Reserva
# regulacion secundaria ($/MW` <dbl>
更新为扩展到多个 URL
首先,更改scrape.js
文件以接受参数:
// scrape2.js
var webPage = require('webpage');
var page = webPage.create();
var system = require('system');
var args = system.args;
var fs = require('fs');
var path = args[2];
page.open(args[1], function (status) {
var content = page.content;
fs.write(path,content,'w');
phantom.exit();
});
接下来,创建列表以循环/遍历/映射(显然这可以被清理/抽象以更易于维护并且需要更少的输入):
urls <- list(
'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20BCS%20MDA%20Hor%202018-12-26%20v2019%2002%2024_01%2000%2001.html',
'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20SIN%20MDA%20Hor%202018-12-29%20v2019%2002%2027_01%2000%2001.html',
'https://www.cenace.gob.mx/DocsMEM/OpeMdo/OfertaCompVent/OferVenta/MDA/Termicas/OfeVtaTermicaHor%20SIN%20MDA%20Hor%202018-12-29%20v2019%2002%2027_01%2000%2001.html'
)
paths <- list(
'page1.html',
'page2.html',
'page3.html'
)
args_list <- map2(urls, paths, paste)
# We are only using this function for the file creation side-effects,
# so we can use walk instead of map.
# This creates the files: page1.html, page2.html, and page3.html
walk(args_list, ~ system(paste("./phantomjs scrape2.js", .)))
此时,您可能希望将抓取的内容放入一个函数中:
read_page <- function(page) {
read_html(page) %>%
html_nodes("table#Tabc") %>%
html_table(header = TRUE) %>%
.[[1]] %>%
as_tibble()
}
从那里您可以重用路径列表来映射您的新功能:
paths %>%
map(~ read_page(.)) %>%
bind_rows()
# A tibble: 9,000 x 38
Codigo `Estatus asigna~ Hora `Limite de desp~ `Limite de desp~ `Costo de Opera~ `Bloque de Pote~ `Costo Incremen~ `Bloque de Pote~
<chr> <chr> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 BTY5W~ ECO 1 35 20 43212. 1.5 1762. 1.5
2 BTY5W~ ECO 2 35 20 43212. 1.5 1762. 1.5
3 BTY5W~ ECO 3 35 20 43212. 1.5 1762. 1.5
4 BTY5W~ ECO 4 35 20 43212. 1.5 1762. 1.5
5 BTY5W~ ECO 5 35 20 43212. 1.5 1762. 1.5
6 BTY5W~ ECO 6 35 20 43212. 1.5 1762. 1.5
7 BTY5W~ ECO 7 35 20 43212. 1.5 1762. 1.5
8 BTY5W~ ECO 8 35 20 43212. 1.5 1762. 1.5
9 BTY5W~ ECO 9 35 20 43212. 1.5 1762. 1.5
10 BTY5W~ ECO 10 35 20 43212. 1.5 1762. 1.5
# ... with 8,990 more rows, and 29 more variables: `Costo Incremental de generacion Bloque 02 ($/MWh)` <dbl>, `Bloque de Potencia 03 (MW)` <dbl>,
# `Costo Incremental de generacion Bloque 03 ($/MWh)` <dbl>, `Bloque de Potencia 04 (MW)` <dbl>, `Costo Incremental de generacion Bloque 04
# ($/MWh)` <dbl>, `Bloque de Potencia 05 (MW)` <dbl>, `Costo Incremental de generacion Bloque 05 ($/MWh)` <dbl>, `Bloque de Potencia 06
# (MW)` <dbl>, `Costo Incremental de generacion Bloque 06 ($/MWh)` <dbl>, `Bloque de Potencia 07 (MW)` <dbl>, `Costo Incremental de generacion
# Bloque 07 ($/MWh)` <dbl>, `Bloque de Potencia 08 (MW)` <dbl>, `Costo Incremental de generacion Bloque 08 ($/MWh)` <dbl>, `Bloque de Potencia
# 09 (MW)` <dbl>, `Costo Incremental de generacion Bloque 09 ($/MWh)` <dbl>, `Bloque de Potencia 10 (MW)` <dbl>, `Costo Incremental de
# generacion Bloque 10 ($/MWh)` <dbl>, `Bloque de Potencia 11 (MW)` <dbl>, `Costo Incremental de generacion Bloque 11 ($/MWh)` <dbl>, `Reserva
# rodante 10 min (MW)` <dbl>, `Costo Reserva rodante 10 min ($/MW)` <dbl>, `Reserva no rodante 10 min (MW)` <dbl>, `Costo Reserva no rodante 10
# min ($/MW)` <dbl>, `Reserva rodante suplementaria (MW)` <dbl>, `Costo Reserva rodante suplementaria ($/MW)` <dbl>, `Reserva no rodante
# suplementaria (MW)` <dbl>, `Costo Reserva no rodante suplementaria ($/MW)` <dbl>, `Reserva regulacion secundaria (MW)` <dbl>, `Costo Reserva
# regulacion secundaria ($/MW` <dbl>
推荐阅读
- java - 评估排序算法的算法复杂度
- python - 如何在 Flask 中创建服务?
- python - 无法使用 sounddevice 录制声音
- ms-access - RecordSource 更新后 MS Access 导航中断
- angular - Angular 6,如何在循环时更新角度输入字段?
- javascript - 将来自 JAX-RS 的 Exel 响应发送到 AngularJS
- php - 如何从 laravel 中的两个不同页面更新一个表
- content-management-system - 在没有 WordPress 或其他 CMS 的情况下从客户端创建具有博客功能的低权重网站
- dart - Flutter 监听来自其他小部件的 textField 值变化
- php - 从 mysqli 数据库中删除字符串字符并利用从数据库派生的两个变量