首页 > 解决方案 > 有没有办法使用 BeatifulSoup 以编程方式编辑 html 文件中的嵌套表?

问题描述

我正在用 BeautifulSoup 在网页中刮一张桌子。我设法将文本放入 txt 文件中。

但是,有些内部包含多个表。我猜开发人员有一些审美指令,他们无法以任何其他方式编辑单元格以满足他们的要求。我在按原样抓取表格时遇到了很多问题,所以我想知道是否存在一种以编程方式编辑 HTML 的方法,以便将这些嵌套表格中的 txt 外推到原始单元格中。

这是我的意思的一个例子。

从这样的嵌套表中

<tr class="table">
             <td class="table" valign="top">
                <p class="tbl-cod">0403</p>
             </td>
             <td class="table" valign="top">
                <p class="tbl-txt">Buttermilk, curdled milk and&nbsp;cream, yoghurt, kephir and other fermented or acidified milk and&nbsp;cream, whether or not concentrated or&nbsp;containing added sugar or other sweetening matter or flavoured or&nbsp;containing added fruit, nuts or&nbsp;cocoa</p>
             </td>
             <td class="table" valign="top">
                <p class="tbl-txt">Manufacture in which:</p>
                <table width="100%" cellspacing="0" cellpadding="0" border="0">
                   <colgroup><col width="4%">
                   <col width="96%">
                   </colgroup><tbody>
                      <tr>
                         <td valign="top">
                            <p class="normal">—&lt;/p>
                         </td>
                         <td valign="top">
                            <p class="normal">all the materials of Chapter&nbsp;4 used are wholly obtained,</p>
                         </td>
                      </tr>
                   </tbody>
                </table>
                <table width="100%" cellspacing="0" cellpadding="0" border="0">
                   <colgroup><col width="4%">
                   <col width="96%">
                   </colgroup><tbody>
                      <tr>
                         <td valign="top">
                            <p class="normal">—&lt;/p>
                         </td>
                         <td valign="top">
                            <p class="normal">all the fruit juice (except that of pineapple, lime or&nbsp;grapefruit) of heading&nbsp;2009 used is originating,</p>
                            <p class="normal">and</p>
                         </td>
                      </tr>
                   </tbody>
                </table>
                <table width="100%" cellspacing="0" cellpadding="0" border="0">
                   <colgroup><col width="4%">
                   <col width="96%">
                   </colgroup><tbody>
                      <tr>
                         <td valign="top">
                            <p class="normal">—&lt;/p>
                         </td>
                         <td valign="top">
                            <p class="normal">the value of all the materials of Chapter&nbsp;17 used does not exceed 30&nbsp;% of the ex-works price of the product</p>
                         </td>
                      </tr>
                   </tbody>
                </table>
             </td>
             <td class="table" valign="top">
                <p class="normal">&nbsp;</p>
             </td>
          </tr>

我想编辑 HTML 文件以获得

<tr class="table">
             <td class="table" valign="top">
                <p class="tbl-cod">0403</p>
             </td>
             <td class="table" valign="top">
                <p class="tbl-txt">Buttermilk, curdled milk and&nbsp;cream, yoghurt, kephir and other fermented or acidified milk and&nbsp;cream, whether or not concentrated or&nbsp;containing added sugar or other sweetening matter or flavoured or&nbsp;containing added fruit, nuts or&nbsp;cocoa</p>
             </td>
             <td class="table" valign="top">
                <p class="tbl-txt">Manufacture in which: all the materials of Chapter&nbsp;4 used are wholly obtained, — all the fruit juice (except that of pineapple, lime or&nbsp;grapefruit) of heading&nbsp;2009 used is originating, — the value of all the materials of Chapter&nbsp;17 used does not exceed 30&nbsp;% of the ex-works price of the product</p>
             </td>
             <td class="table" valign="top">
                <p class="normal">&nbsp;</p>
             </td>
          </tr>

从单元格中的所有嵌套表中。

标签: pythonhtmlweb-scrapinghtml-tablebeautifulsoup

解决方案


是的,如果你html总是这样,你可以这样做。columns在每个内部查找所有内容rows,然后检查该列是否有子table 项然后获取所有P标记的文本,并用first P标记文本替换。然后从列中分解()所有表标签。

代码:

html='''<tr class="table">
             <td class="table" valign="top">
                <p class="tbl-cod">0403</p>
             </td>
             <td class="table" valign="top">
                <p class="tbl-txt">Buttermilk, curdled milk and&nbsp;cream, yoghurt, kephir and other fermented or acidified milk and&nbsp;cream, whether or not concentrated or&nbsp;containing added sugar or other sweetening matter or flavoured or&nbsp;containing added fruit, nuts or&nbsp;cocoa</p>
             </td>
             <td class="table" valign="top">
                <p class="tbl-txt">Manufacture in which:</p>
                <table width="100%" cellspacing="0" cellpadding="0" border="0">
                   <colgroup><col width="4%">
                   <col width="96%">
                   </colgroup><tbody>
                      <tr>
                         <td valign="top">
                            <p class="normal">—&lt;/p>
                         </td>
                         <td valign="top">
                            <p class="normal">all the materials of Chapter&nbsp;4 used are wholly obtained,</p>
                         </td>
                      </tr>
                   </tbody>
                </table>
                <table width="100%" cellspacing="0" cellpadding="0" border="0">
                   <colgroup><col width="4%">
                   <col width="96%">
                   </colgroup><tbody>
                      <tr>
                         <td valign="top">
                            <p class="normal">—&lt;/p>
                         </td>
                         <td valign="top">
                            <p class="normal">all the fruit juice (except that of pineapple, lime or&nbsp;grapefruit) of heading&nbsp;2009 used is originating,</p>
                            <p class="normal">and</p>
                         </td>
                      </tr>
                   </tbody>
                </table>
                <table width="100%" cellspacing="0" cellpadding="0" border="0">
                   <colgroup><col width="4%">
                   <col width="96%">
                   </colgroup><tbody>
                      <tr>
                         <td valign="top">
                            <p class="normal">—&lt;/p>
                         </td>
                         <td valign="top">
                            <p class="normal">the value of all the materials of Chapter&nbsp;17 used does not exceed 30&nbsp;% of the ex-works price of the product</p>
                         </td>
                      </tr>
                   </tbody>
                </table>
             </td>
             <td class="table" valign="top">
                <p class="normal">&nbsp;</p>
             </td>
          </tr>'''

soup=BeautifulSoup(html,'lxml')
for row in soup.find_all('tr',class_='table'):
    for col in row.find_all('td'):
        if col.findChildren("table"):
           #Get all the p tag text from col which contains table
           ptag_text=''.join([i.text for i in col.find_all('p')])
           #Get the first p tag and replace the value with previus value
           col.find('p').next_element.replace_with(ptag_text)
           for item in col.findChildren("table"):
                item.decompose()

print(soup)

输出

<html><body><tr class="table">
<td class="table" valign="top">
<p class="tbl-cod">0403</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa</p>
</td>
<td class="table" valign="top">
<p class="tbl-txt">Manufacture in which:—all the materials of Chapter 4 used are wholly obtained,—all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,and—the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product</p>



</td>
<td class="table" valign="top">
<p class="normal"> </p>
</td>
</tr></body></html>

如果您不想要这些新行,请执行 .replace 所有新行,如下所示。

finalhtml=str(soup).replace('\n','')
print(finalhtml)

输出

<html><body><tr class="table"><td class="table" valign="top"><p class="tbl-cod">0403</p></td><td class="table" valign="top"><p class="tbl-txt">Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa</p></td><td class="table" valign="top"><p class="tbl-txt">Manufacture in which:—all the materials of Chapter 4 used are wholly obtained,—all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,and—the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product</p></td><td class="table" valign="top"><p class="normal"> </p></td></tr></body></html>

现在,如果您想再次格式化,请尝试此操作

finalhtml=str(soup).replace('\n','')
soup=BeautifulSoup(finalhtml,'lxml')
print(soup.prettify(formatter=None))

输出

<html>
 <body>
  <tr class="table">
   <td class="table" valign="top">
    <p class="tbl-cod">
     0403
    </p>
   </td>
   <td class="table" valign="top">
    <p class="tbl-txt">
     Buttermilk, curdled milk and cream, yoghurt, kephir and other fermented or acidified milk and cream, whether or not concentrated or containing added sugar or other sweetening matter or flavoured or containing added fruit, nuts or cocoa
    </p>
   </td>
   <td class="table" valign="top">
    <p class="tbl-txt">
     Manufacture in which:—all the materials of Chapter 4 used are wholly obtained,—all the fruit juice (except that of pineapple, lime or grapefruit) of heading 2009 used is originating,and—the value of all the materials of Chapter 17 used does not exceed 30 % of the ex-works price of the product
    </p>
   </td>
   <td class="table" valign="top">
    <p class="normal">
    </p>
   </td>
  </tr>
 </body>
</html>

推荐阅读