首页 > 解决方案 > 如何使用 Python 将网站上的多个 Excel 工作表下载到 Pandas DataFrame 中

问题描述

我正在尝试从存储在网站上 Excel 工作表中的历史数据创建时间序列。该网站有按年份组织的 Excel 电子表格(即 2009 年、2010 年、2011 年的金融期货头寸......)。

有没有办法一次提取所有相关文件以在 DataFrame 中使用?

我对 Python 很陌生,我的第一个想法是将每个文件手动下载为 Excel 文档,然后使用 python 将它们读入 DF。试图为这个过程找到一个更优雅的解决方案。

网站网址:https ://www.cftc.gov/MarketReports/CommitmentsofTraders/HistoricalCompressed/index.htm

该页面有几组文件。我正在尝试找到一种选择特定文件/文件组的方法。我目前正在谷歌搜索涉及使用 Beautiful Soup 或类似方法分解网站 HTML 的解决方案。

标签: pythonexceldataframeweb-scraping

解决方案


There's probably a more elegant way to find the <p> tag with the associated table of zip files/links you want, but this seemed to get the job done.

你也可能只是想仔细检查一下它是否都在那里。对于某些人来说,它会发出警告:“警告*** OLE2 不一致:SSCS 大小为 0,但 SSAT 大小非零”,但看起来仍然存在。

代码:

import pandas as pd
import requests
from bs4 import BeautifulSoup
from zipfile import ZipFile 
from io import BytesIO

url = 'https://www.cftc.gov/MarketReports/CommitmentsofTraders/HistoricalCompressed/index.htm'

response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')

# Find the p tag with the specific table you wanted
p_tag = soup.find_all('p')
for p in p_tag:
    if 'The complete Commitments of Traders Futures Only reports' in p.text:
        break

# Get that table with the zip links    
table = p.find_next_sibling('table')
a_tags = table.find_all('a', text = 'Excel')

# Create list of those links
files_list = []
for a in a_tags:
    href = 'https://www.cftc.gov' + a['href']
    files_list.append(href)


# Iterate through those links, get he table within the zip, and append to a results dataframe
results = pd.DataFrame()
for file_name in files_list[:-1]:
    year = file_name.split('_')[-1].split('.')[0]
    content = requests.get(file_name)
    zf = ZipFile(BytesIO(content.content))
    excel_file = zf.namelist()[0]

    temp_df = pd.read_excel(zf.open(excel_file))
    results = results.append(temp_df, sort=True).reset_index(drop=True)
    print ('Recieved: %s' %year)

输出:

print (results.head(5).to_string())
   As_of_Date_In_Form_YYMMDD  CFTC_Commodity_Code CFTC_Contract_Market_Code CFTC_Market_Code  CFTC_Region_Code  Change_in_Comm_Long_All  Change_in_Comm_Short_All  Change_in_NonComm_Long_All  Change_in_NonComm_Short_All  Change_in_NonComm_Spead_All  Change_in_NonRept_Long_All  Change_in_NonRept_Short_All  Change_in_Open_Interest_All  Change_in_Tot_Rept_Long_All  Change_in_Tot_Rept_Short_All  Comm_Positions_Long_All  Comm_Positions_Long_Old  Comm_Positions_Long_Other  Comm_Positions_Short_All  Comm_Positions_Short_Old  Comm_Positions_Short_Other  Conc_Gross_LE_4_TDR_Long_All  Conc_Gross_LE_4_TDR_Long_Old  Conc_Gross_LE_4_TDR_Long_Other  Conc_Gross_LE_4_TDR_Short_All  Conc_Gross_LE_4_TDR_Short_Old  Conc_Gross_LE_4_TDR_Short_Other  Conc_Gross_LE_8_TDR_Long_All  Conc_Gross_LE_8_TDR_Long_Old  Conc_Gross_LE_8_TDR_Long_Other  Conc_Gross_LE_8_TDR_Short_All  Conc_Gross_LE_8_TDR_Short_Old  Conc_Gross_LE_8_TDR_Short_Other  Conc_Net_LE_4_TDR_Long_All  Conc_Net_LE_4_TDR_Long_Old  Conc_Net_LE_4_TDR_Long_Other  Conc_Net_LE_4_TDR_Short_All  Conc_Net_LE_4_TDR_Short_Old  Conc_Net_LE_4_TDR_Short_Other  Conc_Net_LE_8_TDR_Long_All  Conc_Net_LE_8_TDR_Long_Old  Conc_Net_LE_8_TDR_Long_Other  Conc_Net_LE_8_TDR_Short_All  Conc_Net_LE_8_TDR_Short_Old  Conc_Net_LE_8_TDR_Short_Other                Contract_Units           Market_and_Exchange_Names  NonComm_Positions_Long_All  NonComm_Positions_Long_Old  NonComm_Positions_Long_Other  NonComm_Positions_Short_All  NonComm_Positions_Short_Old  NonComm_Positions_Short_Other  NonComm_Positions_Spread_Old  NonComm_Positions_Spread_Other  NonComm_Postions_Spread_All  NonRept_Positions_Long_All  NonRept_Positions_Long_Old  NonRept_Positions_Long_Other  NonRept_Positions_Short_All  NonRept_Positions_Short_Old  NonRept_Positions_Short_Other  Open_Interest_All  Open_Interest_Old  Open_Interest_Other  Pct_of_OI_Comm_Long_All  Pct_of_OI_Comm_Long_Old  Pct_of_OI_Comm_Long_Other  Pct_of_OI_Comm_Short_All  Pct_of_OI_Comm_Short_Old  Pct_of_OI_Comm_Short_Other  Pct_of_OI_NonComm_Long_All  Pct_of_OI_NonComm_Long_Old  Pct_of_OI_NonComm_Long_Other  Pct_of_OI_NonComm_Short_All  Pct_of_OI_NonComm_Short_Old  Pct_of_OI_NonComm_Short_Other  Pct_of_OI_NonComm_Spread_All  Pct_of_OI_NonComm_Spread_Old  Pct_of_OI_NonComm_Spread_Other  Pct_of_OI_NonRept_Long_All  Pct_of_OI_NonRept_Long_Old  Pct_of_OI_NonRept_Long_Other  Pct_of_OI_NonRept_Short_All  Pct_of_OI_NonRept_Short_Old  Pct_of_OI_NonRept_Short_Other  Pct_of_OI_Tot_Rept_Long_All  Pct_of_OI_Tot_Rept_Long_Old  Pct_of_OI_Tot_Rept_Long_Other  Pct_of_OI_Tot_Rept_Short_All  Pct_of_OI_Tot_Rept_Short_Old  Pct_of_OI_Tot_Rept_Short_Other  Pct_of_Open_Interest_All  Pct_of_Open_Interest_Old  Pct_of_Open_Interest_Other Report_Date_as_MM_DD_YYYY  Tot_Rept_Positions_Long_All  Tot_Rept_Positions_Long_Old  Tot_Rept_Positions_Long_Other  Tot_Rept_Positions_Short_All  Tot_Rept_Positions_Short_Old  Tot_Rept_Positions_Short_Other  Traders_Comm_Long_All  Traders_Comm_Long_Old  Traders_Comm_Long_Other  Traders_Comm_Short_All  Traders_Comm_Short_Old  Traders_Comm_Short_Other  Traders_NonComm_Long_All  Traders_NonComm_Long_Old  Traders_NonComm_Long_Other  Traders_NonComm_Short_All  Traders_NonComm_Short_Old  Traders_NonComm_Short_Other  Traders_NonComm_Spead_Old  Traders_NonComm_Spread_All  Traders_NonComm_Spread_Other  Traders_Tot_All  Traders_Tot_Old  Traders_Tot_Other  Traders_Tot_Rept_Long_All  Traders_Tot_Rept_Long_Old  Traders_Tot_Rept_Long_Other  Traders_Tot_Rept_Short_All  Traders_Tot_Rept_Short_Old  Traders_Tot_Rept_Short_Other
0                     190910                    1                    001602             CBT                  0                  -4068.0                    4892.0                      6487.0                      -2906.0                       8280.0                      -944.0                       -511.0                       9755.0                      10699.0                       10266.0                   107772                    92050                      15722                    102893                     86995                       15898                          11.9                          12.6                            31.6                           11.6                           12.6                             26.7                          21.1                          22.3                            46.5                           20.0                           21.9                             39.6                        10.8                        11.1                          31.5                         10.0                         10.4                           23.4                        17.4                        18.6                          45.4                         15.5                         16.7                           36.1  (CONTRACTS OF 5,000 BUSHELS)  WHEAT-SRW - CHICAGO BOARD OF TRADE                      121056                      112747                         22502                       111727                       105822                          20098                         88722                            6159                       109074                       25816                       20362                          5454                        40024                        32342                           7682             363718             313881                49837                     29.6                     29.3                       31.5                      28.3                      27.7                        31.9                        33.3                        35.9                          45.2                         30.7                         33.7                           40.3                          30.0                          28.3                            12.4                         7.1                         6.5                          10.9                         11.0                         10.3                           15.4                         92.9                         93.5                           89.1                          89.0                          89.7                            84.6                       100                       100                         100                2019-09-10                       337902                       293519                          44383                        323694                        281539                           42155                     80                     72                       34                      92                      87                        49                       101                       104                          35                        115                        109                           41                        103                         118                            20              346              338                145                        252                        232                           83                         262                         247                            99
1                     190903                    1                    001602             CBT                  0                   -703.0                  -15572.0                     -3482.0                      13336.0                       -702.0                       337.0                      -1612.0                      -4550.0                      -4887.0                       -2938.0                   111840                    97821                      14019                     98001                     82374                       15627                          13.2                          13.7                            32.6                           11.9                           13.0                             25.9                          21.9                          22.9                            45.3                           20.1                           22.2                             37.7                        12.1                        12.2                          32.6                          9.9                         10.4                           22.2                        18.5                        19.0                          44.3                         16.1                         17.7                           33.7  (CONTRACTS OF 5,000 BUSHELS)  WHEAT-SRW - CHICAGO BOARD OF TRADE                      114569                      103964                         22404                       114633                       108323                          18109                         83004                            5991                       100794                       26760                       21199                          5561                        40535                        32287                           8248             353963             305988                47975                     31.6                     32.0                       29.2                      27.7                      26.9                        32.6                        32.4                        34.0                          46.7                         32.4                         35.4                           37.7                          28.5                          27.1                            12.5                         7.6                         6.9                          11.6                         11.5                         10.6                           17.2                         92.4                         93.1                           88.4                          88.5                          89.4                            82.8                       100                       100                         100                2019-09-03                       327203                       284789                          42414                        313428                        273701                           39727                     81                     74                       35                      88                      84                        50                        90                        94                          36                        128                        123                           39                         95                         110                            20              345              338                143                        243                        222                           83                         261                         250                            98
2                     190827                    1                    001602             CBT                  0                 -18756.0                  -10204.0                      5094.0                      -3903.0                     -13782.0                     -1379.0                       -934.0                     -28823.0                     -27444.0                      -27889.0                   112543                   101309                      11234                    113573                     98886                       14687                          12.8                          13.1                            32.9                           12.5                           14.3                             25.9                          20.6                          21.8                            47.3                           21.1                           23.2                             37.5                        11.4                        11.6                          32.1                          9.9                         11.2                           22.1                        17.8                        18.7                          45.1                         16.5                         18.4                           31.3  (CONTRACTS OF 5,000 BUSHELS)  WHEAT-SRW - CHICAGO BOARD OF TRADE                      118051                      108685                         22347                       101297                        97801                          16477                         81990                            6525                       101496                       26423                       20736                          5687                        42147                        34043                           8104             358513             312720                45793                     31.4                     32.4                       24.5                      31.7                      31.6                        32.1                        32.9                        34.8                          48.8                         28.3                         31.3                           36.0                          28.3                          26.2                            14.2                         7.4                         6.6                          12.4                         11.8                         10.9                           17.7                         92.6                         93.4                           87.6                          88.2                          89.1                            82.3                       100                       100                         100                2019-08-27                       332090                       291984                          40106                        316366                        278677                           37689                     85                     81                       30                      96                      94                        51                        99                       104                          35                        110                        106                           38                        103                         116                            20              341              336                139                        252                        238                           76                         264                         255                            99
3                     190820                    1                    001602             CBT                  0                   8679.0                   -1358.0                     -5449.0                       3109.0                       -361.0                     -1090.0                        389.0                       1779.0                       2869.0                        1390.0                   131299                   119922                      11377                    123777                    107310                       16467                          12.2                          12.5                            30.0                           11.0                           12.1                             26.1                          19.9                          20.6                            46.2                           19.7                           21.2                             37.7                        10.0                         9.8                          29.7                          8.4                          9.4                           22.3                        15.8                        16.4                          43.7                         13.9                         15.5                           32.0  (CONTRACTS OF 5,000 BUSHELS)  WHEAT-SRW - CHICAGO BOARD OF TRADE                      112957                      104967                         21051                       105200                       104347                          13914                         96015                            6202                       115278                       27802                       22317                          5485                        43081                        35549                           7532             387336             343221                44115                     33.9                     34.9                       25.8                      32.0                      31.3                        37.3                        29.2                        30.6                          47.7                         27.2                         30.4                           31.5                          29.8                          28.0                            14.1                         7.2                         6.5                          12.4                         11.1                         10.4                           17.1                         92.8                         93.5                           87.6                          88.9                          89.6                            82.9                       100                       100                         100                2019-08-20                       359534                       320904                          38630                        344255                        307672                           36583                     95                     93                       31                      98                      98                        55                        98                       102                          35                        113                        113                           40                        118                         127                            19              350              348                144                        271                        261                           78                         273                         270                           102
4                     190813                    1                    001602             CBT                  0                 -13926.0                  -18764.0                     -2482.0                       2663.0                       4055.0                     -1910.0                      -2217.0                     -14263.0                     -12353.0                      -12046.0                   122620                   112079                      10541                    125135                    107483                       17652                          11.2                          11.2                            30.6                           11.7                           12.5                             25.1                          19.1                          19.6                            44.8                           19.5                           21.0                             38.1                        10.4                        10.3                          30.0                          8.5                         10.1                           22.9                        16.0                        16.7                          42.2                         14.4                         16.2                           33.3  (CONTRACTS OF 5,000 BUSHELS)  WHEAT-SRW - CHICAGO BOARD OF TRADE                      118406                      110864                         21133                       102091                       103554                          12128                         96048                            6000                       115639                       28892                       23513                          5379                        42692                        35419                           7273             385557             342504                43053                     31.8                     32.7                       24.5                      32.5                      31.4                        41.0                        30.7                        32.4                          49.1                         26.5                         30.2                           28.2                          30.0                          28.0                            13.9                         7.5                         6.9                          12.5                         11.1                         10.3                           16.9                         92.5                         93.1                           87.5                          88.9                          89.7                            83.1                       100                       100                         100                2019-08-13                       356665                       318991                          37674                        342865                        307085                           35780                     85                     83                       31                     108                     106                        56                       102                       107                          33                        111                        108                           35                        119                         127                            19              355              352                139                        261                        252                           75                         285                         279                            99

推荐阅读