首页 > 解决方案 > 如何从网页中提取特定数据以将其添加到使用 R 抓取的表中?

问题描述

我已经构建了一个脚本,可以从托管在网络上的表中提取数据,我已经可以可视化这些表,但是为了补充它,我需要将提供程序数据添加为已经构建的表的列,我想知道如何提取提供者数据以附加到我的表中

脚本 R

library(rvest)

urls.colombia.compra.microsoft <- paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=", 
                               0:11, 
                               "&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_=")


base.colombia.compra.microsft <- purrr::map_df(urls.colombia.compra.microsoft, ~.x %>% read_html() %>% html_table())

base.colombia.compra.microsft

urls.colombia.compra.google <- paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=", 
                               0:11, 
                               "&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Google&date_to_=%20&date_from_=")

base.colombia.compra.google <- purrr::map_df(urls.colombia.compra.google, ~.x %>% read_html() %>% html_table())

base.colombia.compra.google

urls.colombia.compra.nube <- paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=", 
                                      0:11, 
                                      "&number_order=&state=&entity=&tool=Nube%20Pública%20III&date_to_=%20&date_from_=")

base.colombia.compra.nube <- purrr::map_df(urls.colombia.compra.nube, ~.x %>% read_html() %>% html_table())

base.colombia.compra.nube

base.consolidada.colombia.compra <- data.table::rbindlist(list(base.colombia.compra.microsft, 
                        base.colombia.compra.google, 
                        base.colombia.compra.nube), idcol = 'id')

base.consolidada.colombia.compra

all_urls <- paste0('https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra/', base.consolidada.colombia.compra$`Orden de Compra`)

new_res <- purrr::map_df(all_urls, ~.x %>% read_html() %>% html_table() %>% .[[1]] %>% dplyr::mutate(order_number = basename(.x), .before = 1))

new_res

library(dplyr)

Base.articulos.colombia.compra <- new_res %>% filter(!is.na(No))

供应商数据如下所示:

提供者数据图像

标签: rweb-scrapingrvest

解决方案


由于您当前的结构存在重复order_number的 s 和不同的 s,artículos因此为请求的数据添加额外的列可能意味着在给定的所有行中重复相同的新数据order_number。这似乎是为了满足您的要求而接受的最简单的条件。

如果没问题,那么您可以mutate()在 anon 函数中简单地添加到您的调用中,然后添加这些额外的列。就个人而言,我选择添加mutate但将 anon func 替换为显式函数,如下所示。然后我将该函数传递给map_df.

我还放弃了抓取所有表,然后进行索引,以便使用返回 1 个节点的类选择器更有效地选择单个表。

我最后添加了一个函数,改编自@hrbrmstrtidy_node给出的答案,在其中我保留了单词之间的空间,其中样式存在于 html 中。该函数添加了 a以保持可读性。这需要额外的库参考。br,

library(xml2)

tidy_node <- function(node){
  xml_find_all(node, ".//br") %>% xml_add_sibling("p", ", ")
  xml_find_all(node, ".//br") %>% xml_remove()
  return(node)
}

get_order_details <- function(url) {
  page <- url %>%
    read_html()
  additional_columns <- page %>% html_elements("#supplier .oc-span")
  table <- page %>%
    html_element(".sticky-enabled") %>%
    html_table() %>%
    dplyr::mutate(
      order_number = basename(url), .before = 1,
      Nombre = additional_columns[1] %>% html_text(),
      `Dirección Principal` = tidy_node(additional_columns[2]) %>% html_text(trim = T),
      `Teléfono (Del Trabajo)` = additional_columns[3] %>% html_text(),
      `Teléfono (Celular)` = additional_columns[4] %>% html_text()
    )
}

new_res <- purrr::map_df(all_urls, get_order_details)

回复:

library(rvest)
library(purrr)
library(dplyr)
library(xml2)

tidy_node <- function(node){
  xml_find_all(node, ".//br") %>% xml_add_sibling("p", ", ")
  xml_find_all(node, ".//br") %>% xml_remove()
  return(node)
}

get_order_details <- function(url) {
  page <- url %>%
    read_html()
  additional_columns <- page %>% html_elements("#supplier .oc-span")
  table <- page %>%
    html_element(".sticky-enabled") %>%
    html_table() %>%
    dplyr::mutate(
      order_number = basename(url), .before = 1,
      Nombre = additional_columns[1] %>% html_text(),
      `Dirección Principal` = tidy_node(additional_columns[2]) %>% html_text(trim = T),
      `Teléfono (Del Trabajo)` = additional_columns[3] %>% html_text(),
      `Teléfono (Celular)` = additional_columns[4] %>% html_text()
    )
}


urls.colombia.compra.microsoft <- paste0(
  "https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
  0:11,
  "&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Microsoft&date_to_=%20&date_from_="
)

base.colombia.compra.microsft <- purrr::map_df(urls.colombia.compra.microsoft, ~ .x %>%
  read_html() %>%
  html_table())

base.colombia.compra.microsft

urls.colombia.compra.google <- paste0(
  "https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
  0:11,
  "&number_order=&state=&entity=&tool=IAD%20Software%20I%20-%20Google&date_to_=%20&date_from_="
)

base.colombia.compra.google <- purrr::map_df(urls.colombia.compra.google, ~ .x %>%
  read_html() %>%
  html_table())

base.colombia.compra.google

urls.colombia.compra.nube <- paste0(
  "https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra?page=",
  0:11,
  "&number_order=&state=&entity=&tool=Nube%20Pública%20III&date_to_=%20&date_from_="
)

base.colombia.compra.nube <- purrr::map_df(urls.colombia.compra.nube, ~ .x %>%
  read_html() %>%
  html_table())

base.colombia.compra.nube

base.consolidada.colombia.compra <- data.table::rbindlist(list(
  base.colombia.compra.microsft,
  base.colombia.compra.google,
  base.colombia.compra.nube
), idcol = "id")

base.consolidada.colombia.compra

all_urls <- paste0("https://colombiacompra.gov.co/tienda-virtual-del-estado-colombiano/ordenes-compra/", base.consolidada.colombia.compra$`Orden de Compra`)

new_res <- purrr::map_df(all_urls, get_order_details)

new_res

Base.articulos.colombia.compra <- new_res %>% filter(!is.na(No))

推荐阅读