首页 > 解决方案 > 使用 R 向下滚动网页抓取

问题描述

我希望从以下 URL 下载前两列(“GAS DAY STARTED ON”和“GAS IN STORAGE”):

https://agsi.gie.eu/#/historical/eu

默认期间设置为“上个月”,我需要“全部”。

有人能告诉我我可以使用什么包来完成这类任务吗?还有一个免费的 API,但我也没有成功。

感谢每一个输入!提前谢谢了!

标签: rweb-scraping

解决方案


让我们尝试引导您更接近 API 路径。如果你有一个 API 密钥,你可以(但你不应该)将它直接传递给下面的函数。你应该把它放在你的~/.Renvironas:

AGSI_KEY=thekeytheygaveyou

并重新启动您的 R 会话。然后它将自动使用。

以下函数采用开始/结束日期

get_agsi_data <- function(start, end, agsi_api_key = Sys.getenv("AGSI_KEY")) {

  start[1] <- as.character(as.Date(start[1]))
  end[1] <- as.character(as.Date(end)[1])

  httr::GET(
    url = "https://agsi.gie.eu/api/data/eu", # NOTE THE HARDCODING FOR eu
    httr::add_headers(`x-key` = agsi_api_key),
    httr::user_agent("user@example.com") # REPLACE THIS WITH YOUR EMAIL ADDRESS
  ) -> res

  httr::stop_for_status(res) # warns when API issues

  out <- httr::content(res, as = "text", encoding = "UTF-8")

  out <- jsonlite::fromJSON(out)

  sapply(out$info, function(x) { # the info element is an ugly list so we need to make it better
    if (length(x)) {
      x <- paste0(x, collapse = "; ") 
    } else {
      NA_character_
    }
  }) -> info

  out$info <- info

  readr::type_convert(
    df = out,
    col_types = cols(
      status = col_character(),
      gasDayStartedOn = col_date(format = ""),
      gasInStorage = col_double(),
      full = col_double(),
      trend = col_double(),
      injection = col_double(),
      withdrawal = col_double(),
      workingGasVolume = col_double(),
      injectionCapacity = col_double(),
      withdrawalCapacity = col_double()
    )
  ) -> out

  class(out) <- c("tbl_df", "tbl", "data.frame")

  out

}

xdf <- get_agsi_data("2018-06-01", "2018-10-01")

xdf
## # A tibble: 2,880 x 11
##    status gasDayStartedOn gasInStorage  full trend injection withdrawal workingGasVolume injectionCapacity
##  * <chr>  <date>                 <dbl> <dbl> <dbl>     <dbl>      <dbl>            <dbl>             <dbl>
##  1 E      2018-11-19              918.  86.1 -0.41      343.      4762.            1067.            11469.
##  2 E      2018-11-18              923.  86.5 -0.22      534.      2841.            1067.            11469.
##  3 E      2018-11-17              925.  86.7 -0.2       649.      2796.            1067.            11469.
##  4 E      2018-11-16              927.  86.9 -0.24      492.      3014.            1067.            11469.
##  5 E      2018-11-15              930.  87.1 -0.16      503.      2210.            1067.            11469.
##  6 E      2018-11-14              931.  87.3 -0.1       605.      1682.            1067.            11469.
##  7 E      2018-11-13              933.  87.4 -0.07      651.      1438.            1067.            11469.
##  8 E      2018-11-12              933.  87.5 -0.05      833.      1391.            1067.            11468.
##  9 E      2018-11-11              934.  87.5  0.09     1607.       659.            1067.            11478.
## 10 E      2018-11-10              933.  87.4  0.06     1458.       796.            1067.            11478.
## # ... with 2,870 more rows, and 2 more variables: withdrawalCapacity <dbl>, info <chr>

eu是硬编码的,但对于其他 API 端点来说应该很简单:

在此处输入图像描述


推荐阅读