首页 > 解决方案 > Pandas read_html 无法正确读取文本

问题描述

我有以下文字:

text = """<table class="table table-striped">\n <thead>\n <tr>\n <th data-field="placement">Placement</th>\n <th data-field="production">Production</th>\n <th data-field="application">Eng.Vol.</th>\n <th data-field="body">Body No</th>\n <th data-field="eng">Eng No</th>\n <th data-field="eng">Notes</th>\n </tr>\n <tr>\n <td data-field="placement">Front Stabilizer</td>\n <td data-field="production">Oct 16~</td>\n <td data-field="application">1.5 L</td>\n <td data-field="body">HRW18</td>\n <td data-field="eng">L15BY</td>\n <td data-field="note" class="">\n Pos:Left/Right </td>\n </tr>\n <tr>\n <td data-field="placement">Front Stabilizer</td>\n <td data-field="production">Oct 16~</td>\n <td data-field="application">1.5 L</td>\n <td data-field="body">HRW18 LHD</td>\n <td data-field="eng">L15BY</td>\n <td data-field="note" class="">\n Pos:Left/Right </td>\n </tr>\n <tr>\n <td data-field="placement">Front Stabilizer</td>\n <td data-field="production">Oct 16~</td>\n <td data-field="application">1.5 L</td>\n <td data-field="body">HRW28</td>\n <td data-field="eng">L15BY</td>\n <td data-field="note" class="">\n Pos:Left/Right </td>\n </tr>\n <tr>\n <td data-field="placement">Front Stabilizer</td>\n <td data-field="production">Oct 16~</td>\n <td data-field="application">2.0 L</td>\n <td data-field="body">HRW38 RHD</td>\n <td data-field="eng">R20A9</td>\n <td data-field="note" class="">\n Pos:Left/Right </td>\n </tr>\n </thead>\n </table>"""

此 HTML 文本使用 table 标记正确关闭,并具有所有必需的标记。pandas 仍然没有作为表格阅读。

代码:

pd.read_html(text)

输出:

[Empty DataFrame
 Columns: [(Placement, Front Stabilizer, Front Stabilizer, Front Stabilizer, Front Stabilizer), (Production, Oct 16~, Oct 16~, Oct 16~, Oct 16~), (Eng.Vol., 1.5 L, 1.5 L, 1.5 L, 2.0 L), (Body No, HRW18, HRW18 LHD, HRW28, HRW38 RHD), (Eng No, L15BY, L15BY, L15BY, R20A9), (Notes, Pos:Left/Right, Pos:Left/Right, Pos:Left/Right, Pos:Left/Right)]
 Index: []]```


标签: pythonhtmlpandasweb-scraping

解决方案


你的桌子被包裹在里面<thead></thead>。pandas 将所有内容都解释为列是可以理解的。我们试试看:

tmp=pd.read_html(text)[0]

pd.DataFrame(tmp.columns.to_frame().values)

输出:

    0           1                 2                 3                 4
--  ----------  ----------------  ----------------  ----------------  ----------------
 0  Placement   Front Stabilizer  Front Stabilizer  Front Stabilizer  Front Stabilizer
 1  Production  Oct 16~           Oct 16~           Oct 16~           Oct 16~
 2  Eng.Vol.    1.5 L             1.5 L             1.5 L             2.0 L
 3  Body No     HRW18             HRW18 LHD         HRW28             HRW38 RHD
 4  Eng No      L15BY             L15BY             L15BY             R20A9
 5  Notes       Pos:Left/Right    Pos:Left/Right    Pos:Left/Right    Pos:Left/Right

推荐阅读