首页 > 解决方案 >
使用 Python 和 Beautiful Soup 从 HTML 表中删除 ` ` 和 ` `

问题描述

我想使用 Python 和 bs4 从抓取的 HTML 表中删除<br>和删除。&nbsp;

HTML表格:

    <tr>
    <td style="width: 15; BORDER-BOTTOM: 1px solid">col1</td>
    <td colspan="2" style="width: 120; BORDER-BOTTOM: 1px solid">&nbsp;col2</td>
    <td style="width: 50; BORDER-BOTTOM: 1px solid">col3</td>
    <td style="width: 50; BORDER-BOTTOM: 1px solid">col5</td>
    <td style="width: 50; BORDER-BOTTOM: 1px solid">col6</td>
    <td style="width: 90; BORDER-BOTTOM: 1px solid" align="center">col7</td>
    <td style="width: 90; BORDER-BOTTOM: 1px solid" align="center">col8</td>
    <td style="width: 10; BORDER-BOTTOM: 1px solid">col9</td>
    <td style="width: 10; BORDER-BOTTOM: 1px solid">col
        <br>&nbsp;1
        <br>0</td>
    <td style="width: 10; BORDER-BOTTOM: 1px solid">col11</td>
    <td style="width: 10; BORDER-BOTTOM: 1px solid" >col12</td>
    <td style="width: 10; BORDER-BOTTOM: 1px solid">col13</td>
    <td style="width: 10; BORDER-BOTTOM: 1px solid">col14</td>
    <td style="width:10;BORDER-BOTTOM: 1px solid;" >col15</td>
</tr>
<tr bordercolor="#000000" class="rows1">
    <td align="left">&nbsp;1</td>
    <td colspan="2" style="BORDER-LEFT: 1px solid" align="left">&nbsp;123456789</td>
    <td style="BORDER-LEFT: 1px solid" align="left">&nbsp;John </td>
    <td style="BORDER-LEFT: 1px solid" align="left">&nbsp;Doe </td>
    <td style="BORDER-LEFT: 1px solid" align="left">&nbsp; </td>
    <td style="BORDER-LEFT: 1px solid" align="right">&nbsp;3.000</td>
    <td style="BORDER-LEFT: 1px solid" align="right">&nbsp;0,00</td>
    <td style="BORDER-LEFT: 1px solid" align="right">&nbsp;30</td>
    <td style="BORDER-LEFT: 1px solid" align="right">&nbsp;0</td>
    <td style="BORDER-LEFT: 1px solid" align="right">&nbsp;</td>
    <td style="BORDER-LEFT: 1px solid" align="right">&nbsp;</td>
    <td style="BORDER-LEFT: 1px solid; BORDER-RIGHT: 1px solid" align="right">&nbsp;</td>
    <td style="BORDER-LEFT: 1px solid; BORDER-RIGHT: 1px solid" align="right">&nbsp;</td>
    <td style="BORDER-LEFT: 1px solid;BORDER-RIGHT: 1px solid;" align="right">&nbsp;5000</td>
</tr>
<tr bordercolor="#000000" class="rows0">
    <td align="left">&nbsp;2</td>
    <td colspan="2" style="BORDER-LEFT: 1px solid" align="left">&nbsp;123456789</td>
    <td style="BORDER-LEFT: 1px solid" align="left">&nbsp;Jane </td>
    <td style="BORDER-LEFT: 1px solid" align="left">&nbsp;Doe </td>
    <td style="BORDER-LEFT: 1px solid" align="left">&nbsp; </td>
    <td style="BORDER-LEFT: 1px solid" align="right">&nbsp;3.000</td>
    <td style="BORDER-LEFT: 1px solid" align="right">&nbsp;0,00</td>
    <td style="BORDER-LEFT: 1px solid" align="right">&nbsp;30</td>
    <td style="BORDER-LEFT: 1px solid" align="right">&nbsp;0</td>
    <td style="BORDER-LEFT: 1px solid" align="right">&nbsp;3</td>
    <td style="BORDER-LEFT: 1px solid" align="right">&nbsp;</td>
    <td style="BORDER-LEFT: 1px solid; BORDER-RIGHT: 1px solid" align="right">&nbsp;</td>
    <td style="BORDER-LEFT: 1px solid; BORDER-RIGHT: 1px solid" align="right">&nbsp;</td>
    <td style="BORDER-LEFT: 1px solid;BORDER-RIGHT: 1px solid;" align="right">&nbsp;5000</td>
</tr>

蟒蛇代码:

import requests
import bs4

url = "http://www.example.com/test.html"
r = requests.get(url)

html = r.text
soup = bs4.BeautifulSoup(html, 'html.parser')
tables = soup.findAll('table')[1]

for tr in tables.findAll('tr')[0:3]:
    cols = tr.findAll('td')
    for tds in cols:
        print ('{:5}'.format(str(tds.text)), end="")
    print()

标签: pythonweb-scrapinghtml-tablebeautifulsoup

解决方案


您可以使用该replace功能。

TEXT = '<br>test&nbsp;&nbsp;test'
TEXT = TEXT.replace('<br>', '').replace('&nbsp;', '')

推荐阅读