首页 > 解决方案 > 从网页中提取表格

问题描述

问题

我正在尝试下载以下网页中的表格: https ://www.ato.gov.au/Rates/Individual-income-tax-for-prior-years/

我的尝试

read_html('https://www.ato.gov.au/Rates/Individual-income-tax-for-prior-years/') %>% 
  html_nodes(xpath = '//tr//*[(((count(preceding-sibling::*) + 1) = 1) and parent::*)]') %>% 
  html_text()

问题是此代码返回 639 行数据。我希望导入的数据能够维护其在网页上的表结构(例如表列表甚至是一个大数据框)。

标签: rrvest

解决方案


我建议将它们保留为数据框列表并区分表,用caption可用的名称命名它们

library(dplyr)
library(rvest)

url <- "https://www.ato.gov.au/Rates/Individual-income-tax-for-prior-years/"
url %>%
  read_html() %>%
  html_table() %>%
  setNames(., url %>%
               read_html() %>%
              html_nodes("caption") %>%
              html_text())


#$`Resident tax rates for 2016-17`
#      Taxable income                         Tax on this income
#1        0 – $18,200                                        Nil
#2  $18,201 – $37,000               19c for each $1 over $18,200
#3  $37,001 – $87,000 $3,572 plus 32.5c for each $1 over $37,000
#4 $87,001 – $180,000  $19,822 plus 37c for each $1 over $87,000
#5  $180,001 and over $54,232 plus 45c for each $1 over $180,000

#$`Resident tax rates for 2015-16`
#      Taxable income                         Tax on this income
#1        0 – $18,200                                        Nil
#2  $18,201 – $37,000               19c for each $1 over $18,200
#3  $37,001 – $80,000 $3,572 plus 32.5c for each $1 over $37,000
#4 $80,001 – $180,000  $17,547 plus 37c for each $1 over $80,000
#5  $180,001 and over $54,547 plus 45c for each $1 over $180,000
#......

如果您希望它作为一个单独的数据框,我们可以使用bind_rowswith.id参数

url %>%
  read_html() %>%
  html_table() %>%
  setNames(., url %>%
              read_html() %>%
              html_nodes("caption") %>%
              html_text()) %>%
   bind_rows(.id = "id")

推荐阅读