python-3.x - 连接由 pd.read_html 制成的多个 df
问题描述
我的标题没有任何意义,所以我将简要介绍一下情况
我正在从基本上是表格的站点中抓取数据,但在这种情况下,每一行都是一个表格元素,而且每个奇数表格元素都没有用,所以我正在消除
所以我想要的是使用 read_html() 连接由每个偶数表元素组成的每个单独的数据帧
下面是我的代码
import pandas as pd
all_table = ["""<table cellpadding="0" cellspacing="0" cols="8" width="100%">
<tbody><tr height="10px">
<td align="right" colspan="9">
<font color="#D5D5D5">.</font>
</td>
</tr>
<tr height="30px" valign="middle" width="100%">
<td class="size-12" colspan="8" width="100%">
<strong>Shipment Status</strong>
</td>
</tr>
<tr valign="bottom" width="100%">
<td align="center" class="size-10" width="10%">
<strong>Station</strong>
</td>
<td align="center" class="size-10" width="10%">
<strong>Flight No.</strong>
</td>
<td align="center" class="size-10" width="25%">
<strong>Status</strong>
</td>
<td align="center" class="size-10" width="15%">
<strong>Date</strong>
</td>
<td align="center" class="size-10" width="9%">
<strong>Time</strong>
</td>
<td align="center" class="size-10" width="8%">
<strong>Pcs</strong>
</td>
<td align="center" class="size-10" width="8%">
<strong>Wgt</strong>
</td>
<td align="center" class="size-10" width="15%">
<strong>ULD - Battery - Temp</strong>
</td>
</tr>
<tr bgcolor="#F0F0F0" class="result-row">
<td align="center" class="size-10" width="10%">KIX</td>
<td align="center" class="size-10" width="10%">
<center>-</center>
</td>
<td align="center" class="size-10" width="25%">Shipment Received</td>
<td align="center" class="size-10" width="15%">11 Oct 2019</td>
<td align="center" class="size-10" width="9%"> 22:45</td>
<td align="center" class="size-10" width="8%">34</td>
<td align="center" class="size-10" width="8%">411.3</td>
<td align="center" class="size-10" width="15%"></td>
</tr>
</tbody></table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
</table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
<tbody><tr bgcolor="#FFFFFF" class="result-row">
<td align="center" class="size-10" width="10%">KIX</td>
<td align="center" class="size-10" width="10%">
<center>-</center>
</td>
<td align="center" class="size-10" width="25%">Freight On Hand</td>
<td align="center" class="size-10" width="15%">11 Oct 2019</td>
<td align="center" class="size-10" width="9%"> 22:45</td>
<td align="center" class="size-10" width="8%">34</td>
<td align="center" class="size-10" width="8%">411.3</td>
<td align="center" class="size-10" width="15%"></td>
</tr>
</tbody></table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
</table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
<tbody><tr bgcolor="#F0F0F0" class="result-row">
<td align="center" class="size-10" width="10%">KIX</td>
<td align="center" class="size-10" width="10%">SQ0621</td>
<td align="center" class="size-10" width="25%">Flight Departed</td>
<td align="center" class="size-10" width="15%">13 Oct 2019</td>
<td align="center" class="size-10" width="9%"> 17:18</td>
<td align="center" class="size-10" width="8%">34</td>
<td align="center" class="size-10" width="8%">411.3</td>
<td align="center" class="size-10" width="15%"></td>
</tr>
</tbody></table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
</table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
<tbody><tr bgcolor="#FFFFFF" class="result-row">
<td align="center" class="size-10" width="10%">SIN</td>
<td align="center" class="size-10" width="10%">SQ0621</td>
<td align="center" class="size-10" width="25%">Flight Arrived</td>
<td align="center" class="size-10" width="15%">13 Oct 2019</td>
<td align="center" class="size-10" width="9%"> 23:02</td>
<td align="center" class="size-10" width="8%">34</td>
<td align="center" class="size-10" width="8%">411.3</td>
<td align="center" class="size-10" width="15%"></td>
</tr>
</tbody></table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
</table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
<tbody><tr bgcolor="#F0F0F0" class="result-row">
<td align="center" class="size-10" width="10%">SIN</td>
<td align="center" class="size-10" width="10%">SQ0621</td>
<td align="center" class="size-10" width="25%">Flight Arrived</td>
<td align="center" class="size-10" width="15%">13 Oct 2019</td>
<td align="center" class="size-10" width="9%"> 23:02</td>
<td align="center" class="size-10" width="8%">34</td>
<td align="center" class="size-10" width="8%">411.3</td>
<td align="center" class="size-10" width="15%"></td>
</tr>
</tbody></table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
</table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
<tbody><tr bgcolor="#FFFFFF" class="result-row">
<td align="center" class="size-10" width="10%">SIN</td>
<td align="center" class="size-10" width="10%">SQ0621</td>
<td align="center" class="size-10" width="25%">Shipment Checked Into Warehouse</td>
<td align="center" class="size-10" width="15%">14 Oct 2019</td>
<td align="center" class="size-10" width="9%"> 02:57</td>
<td align="center" class="size-10" width="8%">34</td>
<td align="center" class="size-10" width="8%">411.3</td>
<td align="center" class="size-10" width="15%"></td>
</tr>
</tbody></table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
</table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
<tbody><tr bgcolor="#F0F0F0" class="result-row">
<td align="center" class="size-10" width="10%">SIN</td>
<td align="center" class="size-10" width="10%">SQ0422</td>
<td align="center" class="size-10" width="25%">Flight Departed</td>
<td align="center" class="size-10" width="15%">14 Oct 2019</td>
<td align="center" class="size-10" width="9%"> 07:39</td>
<td align="center" class="size-10" width="8%">34</td>
<td align="center" class="size-10" width="8%">411.3</td>
<td align="center" class="size-10" width="15%"></td>
</tr>
</tbody></table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
</table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
<tbody><tr bgcolor="#FFFFFF" class="result-row">
<td align="center" class="size-10" width="10%">BOM</td>
<td align="center" class="size-10" width="10%">SQ0422</td>
<td align="center" class="size-10" width="25%">Flight Arrived</td>
<td align="center" class="size-10" width="15%">14 Oct 2019</td>
<td align="center" class="size-10" width="9%"> 10:12</td>
<td align="center" class="size-10" width="8%">34</td>
<td align="center" class="size-10" width="8%">411.3</td>
<td align="center" class="size-10" width="15%"></td>
</tr>
</tbody></table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
</table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
<tbody><tr bgcolor="#F0F0F0" class="result-row">
<td align="center" class="size-10" width="10%">BOM</td>
<td align="center" class="size-10" width="10%">SQ0422</td>
<td align="center" class="size-10" width="25%">Flight Arrived</td>
<td align="center" class="size-10" width="15%">14 Oct 2019</td>
<td align="center" class="size-10" width="9%"> 10:30</td>
<td align="center" class="size-10" width="8%">34</td>
<td align="center" class="size-10" width="8%">411.3</td>
<td align="center" class="size-10" width="15%"></td>
</tr>
</tbody></table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
</table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
<tbody><tr bgcolor="#FFFFFF" class="result-row">
<td align="center" class="size-10" width="10%">BOM</td>
<td align="center" class="size-10" width="10%">SQ0422</td>
<td align="center" class="size-10" width="25%">Shipment Checked Into Warehouse</td>
<td align="center" class="size-10" width="15%">14 Oct 2019</td>
<td align="center" class="size-10" width="9%"> 14:10</td>
<td align="center" class="size-10" width="8%">34</td>
<td align="center" class="size-10" width="8%">411.3</td>
<td align="center" class="size-10" width="15%"></td>
</tr>
</tbody></table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
</table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
<tbody><tr bgcolor="#F0F0F0" class="result-row">
<td align="center" class="size-10" width="10%">BOM</td>
<td align="center" class="size-10" width="10%">
<center>-</center>
</td>
<td align="center" class="size-10" width="25%">Shipment Ready for Pick-up</td>
<td align="center" class="size-10" width="15%">14 Oct 2019</td>
<td align="center" class="size-10" width="9%"> 14:21</td>
<td align="center" class="size-10" width="8%">34</td>
<td align="center" class="size-10" width="8%">411.3</td>
<td align="center" class="size-10" width="15%"></td>
</tr>
</tbody></table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
</table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
<tbody><tr bgcolor="#FFFFFF" class="result-row">
<td align="center" class="size-10" width="10%">BOM</td>
<td align="center" class="size-10" width="10%">
<center>-</center>
</td>
<td align="center" class="size-10" width="25%">Document Delivered</td>
<td align="center" class="size-10" width="15%">14 Oct 2019</td>
<td align="center" class="size-10" width="9%"> 17:15</td>
<td align="center" class="size-10" width="8%">34</td>
<td align="center" class="size-10" width="8%">411.3</td>
<td align="center" class="size-10" width="15%"></td>
</tr>
</tbody></table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
</table>, <table cellpadding="0" cellspacing="0" cols="8" width="100%">
<tbody><tr bgcolor="#F0F0F0" class="result-row">
<td align="center" class="size-10" width="10%">BOM</td>
<td align="center" class="size-10" width="10%">
<center>-</center>
</td>
<td align="center" class="size-10" width="25%">Shipment Delivered</td>
<td align="center" class="size-10" width="15%">14 Oct 2019</td>
<td align="center" class="size-10" width="9%"> 17:15</td>
<td align="center" class="size-10" width="8%">34</td>
<td align="center" class="size-10" width="8%">411.3</td>
<td align="center" class="size-10" width="15%"></td>
</tr>
</tbody></table>"""]
final_delivary = pd.DataFrame()
a = 0
for i in range(len(all_table)):
print("-"*150)
if a % 2 == 0:
print(a)
# print(all_table[a])
tmp_table = all_table[a]
tmp_df = pd.read_html(str(tmp_table))
print("tmp_df = \n", tmp_df)
print("type of tmp_df = ", type(tmp_df))
print("#"*75)
tmp_df2 = pd.DataFrame(tmp_df[0])
print("tmp_df2 = \n", tmp_df2)
print("type of tmp_df2 = ", type(tmp_df2))
print("@"*75)
print("final_delivary = \n", final_delivary)
print("type of final_delivary = ", type(final_delivary))
pd.concat([final_delivary, tmp_df2], axis=0)
else:
print("nope")
a+=1
print("final_delivary = ", final_delivary)
所以我在将单个数据帧连接到主数据帧时遇到问题,结果是空数据帧,所以请帮助我
解决方案
尝试这个
from bs4 import BeautifulSoup as bs
import pandas as pd
all_table = '''
html content
'''
finalDf = pd.DataFrame()
soup = bs(all_table)
tables = soup.findAll("table")
for i,table in enumerate(tables):
if i%2==0:
df = pd.read_html(str(table))
finalDf = pd.concat([finalDf,df[0]])
推荐阅读
- jenkins - 詹金的项目没有自动触发
- qt - 在没有 Canvas 的 QML 中绘制虚线圆
- php - php将html转换为pdf并下载
- scala - Scala - 如何将文件中的数据映射为不同的格式?
- php - 在PHP中显示多个表中的外键
- laravel - Predis:Client.php:370 处的 ERR 未知命令“EVAL”
- wso2esb - 如何在 WSO2 FTP 中用十六进制值替换特殊字符?
- python - 如何从python中的字符串中删除这些特殊的ascii字符?
- angular - 角度文本框的动态串联名称
- shared-hosting - 我所有的帖子请求都给我“由于不活动,该页面已过期。请刷新并重试。” 我的共享主机 laravel 5.6 中的错误