首页 > 解决方案 > 无法以自定义方式从表中刮取一些数据

问题描述

我正在尝试从一些 html 元素中解析表格内容并以自定义方式排列它们,以便以后可以在 csv 文件中相应地编写它们。

这张桌子看起来几乎一模一样。

Html 元素就像(截断):

<tr>
    <td align="center" colspan="4" class="header">ATLANTIC</td>
</tr>
<tr>
    <td class="black10bold">Facility</td>
    <td class="black10bold">Type</td>
    <td class="black10bold">Funding</td>
</tr>
<tr>
    <td style="width: 55%">
        <a href="fsFacilityDetails.aspx?item=NJ60104"> Complete Care at Linwood, LLC </a>
    </td>
</tr>
<tr>
    <td style="width: 55%">
        <a href="fsFacilityDetails.aspx?item=NJ60102">The Health Center At Galloway</a>
    </td>
</tr>

<tr>
    <td align="center" colspan="4" class="header">BERGEN</td>
</tr>

<tr>
    <td class="black10bold">Facility</td>
    <td class="black10bold">Type</td>
    <td class="black10bold">Funding</td>
</tr>

<tr>
    <td style="width: 55%">
        <a href="fsFacilityDetails.aspx?item=30201">The Actors Fund Homes</a>
    </td>
</tr>
<tr>
    <td style="width: 55%">
        <a href="fsFacilityDetails.aspx?item=NJAL02007"> Actors Fund Home, The </a>
    </td>
</tr>

到目前为止我已经尝试过:

for item in soup.select("tr"):
    try:
        header = item.select_one("td.header").text
    except AttributeError:
        header = ""
    try:
        item_name = item.select_one("td > a").text
    except AttributeError:
        item_name = ""
    print(item_name,header)

它产生的输出:

ATLANTIC
 
Complete Care at Linwood, LLC  
The Health Center At Galloway 

 BERGEN
 
The Actors' Fund Homes
Actors Fund Home, The 

我想要的输出:

Complete Care at Linwood, LLC  ATLANTIC
The Health Center At Galloway  ATLANTIC
The Actors' Fund Homes         BERGEN
Actors Fund Home, The          BERGEN

标签: pythonpython-3.xweb-scrapingbeautifulsoup

解决方案


这应该以您想要的方式产生输出。

for item in soup.select("tr"):
    if item.select_one("td.header"):
        header = item.select_one("td.header").text

    elif item.select_one("td > a"):
        item_name = item.select_one("td > a").text
        print(item_name,header)

推荐阅读