首页 > 解决方案 > HTML解析排序

问题描述

所以我有一个包含一个大表的 HTML 文件。正如您在下面看到的,数据的第一行包含标题,其余行是电影的信息。

          <tr class="row0">
            <td class="column0 style0 s">show_id</td>
            <td class="column1 style0 s">type</td>
            <td class="column2 style0 s">title</td>
            <td class="column3 style0 s">director</td>
            <td class="column4 style0 s">cast</td>
            <td class="column5 style0 s">country</td>
            <td class="column6 style0 s">date_added</td>
            <td class="column7 style0 s">release_year</td>
            <td class="column8 style0 s">rating</td>
            <td class="column9 style0 s">duration</td>
            <td class="column10 style0 s">listed_in</td>
            <td class="column11 style0 s">description</td>
          </tr>
          <tr class="row1">
            <td class="column0 style0 n">81145628</td>
            <td class="column1 style0 s">Movie</td>
            <td class="column2 style0 s">Norm of the North: King Sized Adventure</td>
            <td class="column3 style0 s">Richard Finn, Tim Maltby</td>
            <td class="column4 style0 s">Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson</td>
            <td class="column5 style0 s">United States, India, South Korea, China</td>
            <td class="column6 style0 s">September 9, 2019</td>
            <td class="column7 style0 n">2019</td>
            <td class="column8 style0 s">TV-PG</td>
            <td class="column9 style0 s">90 min</td>
            <td class="column10 style0 s">Children &amp; Family Movies, Comedies</td>
            <td class="column11 style0 s">Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from an evil archaeologist first.</td>
#... continue to row100

我正在尝试开发一个返回列表列表或字典列表的函数,以回答有关数据的一些问题。我知道用于获取文本的 get_text() 函数,但不确定如何真正实现其余部分。我对 python 很陌生,所以非常感谢任何帮助。

标签: pythonhtmlparsing

解决方案


您可以使用:

from bs4 import BeautifulSoup as bs

with open("nf_shows.html", encoding="utf-8") as f:
    html = f.read()

soup = bs(html, "html5lib")
table  = soup.find("tbody").find_all("tr")
headers = [x.text.strip() for x in table[0].find_all("td")]

tv_shows = []
for tv_show in table[1:]:
    vals = [x.text.strip() for x in tv_show.find_all("td")]
    tv_dict = dict(zip(headers, vals))
    tv_shows.append(tv_dict)

{'show_id': '81145628', 'type': 'Movie', 'title': 'Norm of the North: King Sized Adventure', 'director': 'Richard Finn, Tim Maltby', 'cast': 'Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson', 'country': 'United States, India, South Korea, China', 'date_added': 'September 9, 2019', 'release_year': '2019', 'rating': 'TV-PG', 'duration': '90 min', 'listed_in': 'Children & Family Movies, Comedies', 'description': 'Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from an evil archaeologist first.'},
...

演示


推荐阅读