r - 如何从网页中提取特定数据以将其添加到使用 R 抓取的表中?
问题描述
我已经构建了一个脚本,可以从托管在网络上的表中提取数据,我已经可以可视化这些表,但是为了补充它,我需要将提供程序数据添加为已经构建的表的列,我想知道如何提取提供者数据以附加到我的表中
脚本 R
library(rvest)
urls.colombia.compra.microsoft <- paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
0:11,
"&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_=")
base.colombia.compra.microsft <- purrr::map_df(urls.colombia.compra.microsoft, ~.x %>% read_html() %>% html_table())
base.colombia.compra.microsft
urls.colombia.compra.google <- paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
0:11,
"&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Google&date_to_=%20&date_from_=")
base.colombia.compra.google <- purrr::map_df(urls.colombia.compra.google, ~.x %>% read_html() %>% html_table())
base.colombia.compra.google
urls.colombia.compra.nube <- paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
0:11,
"&number_order=&state=&entity=&tool=Nube%20Pública%20III&date_to_=%20&date_from_=")
base.colombia.compra.nube <- purrr::map_df(urls.colombia.compra.nube, ~.x %>% read_html() %>% html_table())
base.colombia.compra.nube
base.consolidada.colombia.compra <- data.table::rbindlist(list(base.colombia.compra.microsft,
base.colombia.compra.google,
base.colombia.compra.nube), idcol = 'id')
base.consolidada.colombia.compra
all_urls <- paste0('https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra/', base.consolidada.colombia.compra$`Orden de Compra`)
new_res <- purrr::map_df(all_urls, ~.x %>% read_html() %>% html_table() %>% .[[1]] %>% dplyr::mutate(order_number = basename(.x), .before = 1))
new_res
library(dplyr)
Base.articulos.colombia.compra <- new_res %>% filter(!is.na(No))
供应商数据如下所示:
解决方案
由于您当前的结构存在重复order_number
的 s 和不同的 s,artículos
因此为请求的数据添加额外的列可能意味着在给定的所有行中重复相同的新数据order_number
。这似乎是为了满足您的要求而接受的最简单的条件。
如果没问题,那么您可以mutate()
在 anon 函数中简单地添加到您的调用中,然后添加这些额外的列。就个人而言,我选择添加mutate
但将 anon func 替换为显式函数,如下所示。然后我将该函数传递给map_df
.
我还放弃了抓取所有表,然后进行索引,以便使用返回 1 个节点的类选择器更有效地选择单个表。
我最后添加了一个函数,改编自@hrbrmstrtidy_node
给出的答案,在其中我保留了单词之间的空间,其中样式存在于 html 中。该函数添加了 a以保持可读性。这需要额外的库参考。br
,
library(xml2)
tidy_node <- function(node){
xml_find_all(node, ".//br") %>% xml_add_sibling("p", ", ")
xml_find_all(node, ".//br") %>% xml_remove()
return(node)
}
get_order_details <- function(url) {
page <- url %>%
read_html()
additional_columns <- page %>% html_elements("#supplier .oc-span")
table <- page %>%
html_element(".sticky-enabled") %>%
html_table() %>%
dplyr::mutate(
order_number = basename(url), .before = 1,
Nombre = additional_columns[1] %>% html_text(),
`Dirección Principal` = tidy_node(additional_columns[2]) %>% html_text(trim = T),
`Teléfono (Del Trabajo)` = additional_columns[3] %>% html_text(),
`Teléfono (Celular)` = additional_columns[4] %>% html_text()
)
}
new_res <- purrr::map_df(all_urls, get_order_details)
回复:
library(rvest)
library(purrr)
library(dplyr)
library(xml2)
tidy_node <- function(node){
xml_find_all(node, ".//br") %>% xml_add_sibling("p", ", ")
xml_find_all(node, ".//br") %>% xml_remove()
return(node)
}
get_order_details <- function(url) {
page <- url %>%
read_html()
additional_columns <- page %>% html_elements("#supplier .oc-span")
table <- page %>%
html_element(".sticky-enabled") %>%
html_table() %>%
dplyr::mutate(
order_number = basename(url), .before = 1,
Nombre = additional_columns[1] %>% html_text(),
`Dirección Principal` = tidy_node(additional_columns[2]) %>% html_text(trim = T),
`Teléfono (Del Trabajo)` = additional_columns[3] %>% html_text(),
`Teléfono (Celular)` = additional_columns[4] %>% html_text()
)
}
urls.colombia.compra.microsoft <- paste0(
"https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
0:11,
"&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_="
)
base.colombia.compra.microsft <- purrr::map_df(urls.colombia.compra.microsoft, ~ .x %>%
read_html() %>%
html_table())
base.colombia.compra.microsft
urls.colombia.compra.google <- paste0(
"https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
0:11,
"&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Google&date_to_=%20&date_from_="
)
base.colombia.compra.google <- purrr::map_df(urls.colombia.compra.google, ~ .x %>%
read_html() %>%
html_table())
base.colombia.compra.google
urls.colombia.compra.nube <- paste0(
"https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
0:11,
"&number_order=&state=&entity=&tool=Nube%20Pública%20III&date_to_=%20&date_from_="
)
base.colombia.compra.nube <- purrr::map_df(urls.colombia.compra.nube, ~ .x %>%
read_html() %>%
html_table())
base.colombia.compra.nube
base.consolidada.colombia.compra <- data.table::rbindlist(list(
base.colombia.compra.microsft,
base.colombia.compra.google,
base.colombia.compra.nube
), idcol = "id")
base.consolidada.colombia.compra
all_urls <- paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra/", base.consolidada.colombia.compra$`Orden de Compra`)
new_res <- purrr::map_df(all_urls, get_order_details)
new_res
Base.articulos.colombia.compra <- new_res %>% filter(!is.na(No))
推荐阅读
- django - 关闭时间过长被杀
- python - 重命名熊猫中的选定列
- javascript - => Angularjs。对于传入的值,我是否正确理解这一点?
- javascript - 在 ChromeDevTools-GUI 中更改 Sources --> Scope --> Local 中的变量值
- angular - 如何对 mat-autocomplete 选项的选择进行单元测试?
- android - 设备管理员和运行时权限
- javascript - 当我运行 npm run compile:sass 时,它会给出错误“npm ERR!缺少脚本:编译:sass”
- nginx - 配置 Nginx 以根据浏览器类型有条件地提供静态内容
- python-3.x - 如何使用另一个python文件(file2)打开python文件(file1),但名称file2是文本文件中的字符串?
- python - `export OPENBLAS_NUM_THREADS=1`后CPU使用率有时超过100%是正常现象吗?