首页 > 解决方案 > 使用 Pandas read_excel() 下载 xlsx 文件的完整行失败

问题描述

该文件应该有数千行。但是在下面使用它只会返回数据框中的前几行

档案 https://www.hkex.com.hk/eng/services/trading/securities/securitieslists/ListOfSecurities.xlsx

失败的例子

import pandas as pd

url = 'https://www.hkex.com.hk/eng/services/trading/securities/securitieslists/ListOfSecurities.xlsx'
df = pd.read_excel(url, engine='openpyxl', header=2, usecols='A:D', verbose=True)
print(df.shape)
# output - only 5 rows
Reading sheet 0
(5, 4)

工作示例

同一个文件。首先下载它,在 Excel 中打开,修改文本并保存(没有更改格式并保留 xlsx),然后使用 read_excel() 从文件打开

url = 'https://www.hkex.com.hk/eng/services/trading/securities/securitieslists/ListOfSecurities.xlsx'
path = os.path.join(os.path.dirname(__file__), 'download')
wget.download(url, out=path)
file = os.path.join(path, 'ListOfSecurities.xlsx')

# open to edit and then save in Excel

df = pd.read_excel(file, engine='openpyxl', header=2, usecols='A:D', verbose=True)
print(df.shape)
# output
Reading sheet 0
(17490, 4)

标签: pythonexcelpandasopenpyxl

解决方案


更新:基于xlrd不可行使用的上下文更改代码

import pandas as pd
import os 
import wget

url = 'https://www.hkex.com.hk/eng/services/trading/securities/securitieslists/ListOfSecurities.xlsx'
path = os.path.join(os.path.dirname(__file__), 'download')
wget.download(url, out=path)
filename = os.path.join(path, 'ListOfSecurities.xlsx')

from openpyxl import load_workbook

excel_file = load_workbook(filename)
sheet = excel_file["ListOfSecurities"]
sheet.delete_cols(5,21) # Use only Cols A:D

data = sheet.values
cols = next(data) # Skip row 0
cols = next(data) # Skip row 1
cols = next(data)[0:4] # Cols A:D


df = pd.DataFrame(data, columns=cols)

print(df.shape)

我将 excel 引擎更改为使用xlrdpandas 中的默认引擎 ( ) 并且以下代码有效。

import pandas as pd
import os 
import wget

url = 'https://www.hkex.com.hk/eng/services/trading/securities/securitieslists/ListOfSecurities.xlsx'
path = os.path.join(os.path.dirname(__file__), 'download')
wget.download(url, out=path)
filename = os.path.join(path, 'ListOfSecurities.xlsx')

df = pd.read_excel(filename, header=2, usecols='A:D', verbose=True)
print(df.shape)

输出中的一个不一致之处是它显示的行数减少了 4 行:

Reading sheet 0
(17486, 4)

推荐阅读