首页 > 解决方案 > rvest read_html 用于特定表

问题描述

我正在尝试在 R 中抓取网页。在此处的目录中:

https://www.sec.gov/Archives/edgar/data/1800/000104746911001056/a2201962z10-k.htm#du42901a_main_toc

我对

Consolidated Statement of Earnings - Page 50 Consolidated Statement of Cash Flows - Page 51 Consolidated Balance Sheet - Page 52

根据文档的不同,这些语句所在的页码可能会有所不同。

我正在尝试使用来查找这些文档,html_nodes()但似乎无法正常工作。当我检查 url 时,我找到了表,<div align="CENTER"> == $0但我找不到表 ID 键。

url <- "https://www.sec.gov/Archives/edgar/data/1800/000104746911001056/a2201962z10-k.htm"


dat <- url %>%
  read_html() %>%
  html_table(fill = TRUE)

任何朝着正确方向的推动都会很棒!

编辑:我知道 finreportr 和 finstr 包,但它们正在使用 XML 文档,并非所有 .HTML 页面都有 XML 文档——我也想使用rvest包来执行此操作。

编辑:

类似于以下作品:

    url <- "https://www.sec.gov/Archives/edgar/data/936340/000093634015000014/dteenergy2014123110k.htm"
    population <- url %>%
      read_html() %>%
      html_nodes(xpath='/html/body/document/type/sequence/filename/description/text/div[623]/div/table') %>%
      html_table()
x <- population[[1]]

它非常混乱,但它确实得到了现金流量表。Xpath 会根据网页而变化。

例如这个是不同的:

url <- "https://www.sec.gov/Archives/edgar/data/80661/000095015205001650/l12357ae10vk.htm"

population <- url %>%
  read_html() %>%
  html_nodes(xpath='/html/body/document/type/sequence/filename/description/text/div[30]/div/table') %>%
  html_table()

x <- population[[1]]

有没有办法“搜索”“现金流”表并以某种方式提取xpath

更多链接可以尝试。

[1] "https://www.sec.gov/Archives/edgar/data/1281761/000095014405002476/g93593e10vk.htm"   
 [2] "https://www.sec.gov/Archives/edgar/data/721683/000095014407001713/g05204e10vk.htm"    
 [3] "https://www.sec.gov/Archives/edgar/data/72333/000007233318000049/jwn-232018x10k.htm"  
 [4] "https://www.sec.gov/Archives/edgar/data/1001082/000095013406005091/d33908e10vk.htm"   
 [5] "https://www.sec.gov/Archives/edgar/data/7084/000000708403000065/adm10ka2003.htm"      
 [6] "https://www.sec.gov/Archives/edgar/data/78239/000007823910000015/tenkjan312010.htm"   
 [7] "https://www.sec.gov/Archives/edgar/data/1156039/000119312508035367/d10k.htm"          
 [8] "https://www.sec.gov/Archives/edgar/data/909832/000090983214000021/cost10k2014.htm"    
 [9] "https://www.sec.gov/Archives/edgar/data/91419/000095015205005873/l13520ae10vk.htm"    
[10] "https://www.sec.gov/Archives/edgar/data/4515/000000620114000004/aagaa10k-20131231.htm"

标签: r

解决方案


推荐阅读