python - HTML解析排序
问题描述
所以我有一个包含一个大表的 HTML 文件。正如您在下面看到的,数据的第一行包含标题,其余行是电影的信息。
<tr class="row0">
<td class="column0 style0 s">show_id</td>
<td class="column1 style0 s">type</td>
<td class="column2 style0 s">title</td>
<td class="column3 style0 s">director</td>
<td class="column4 style0 s">cast</td>
<td class="column5 style0 s">country</td>
<td class="column6 style0 s">date_added</td>
<td class="column7 style0 s">release_year</td>
<td class="column8 style0 s">rating</td>
<td class="column9 style0 s">duration</td>
<td class="column10 style0 s">listed_in</td>
<td class="column11 style0 s">description</td>
</tr>
<tr class="row1">
<td class="column0 style0 n">81145628</td>
<td class="column1 style0 s">Movie</td>
<td class="column2 style0 s">Norm of the North: King Sized Adventure</td>
<td class="column3 style0 s">Richard Finn, Tim Maltby</td>
<td class="column4 style0 s">Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson</td>
<td class="column5 style0 s">United States, India, South Korea, China</td>
<td class="column6 style0 s">September 9, 2019</td>
<td class="column7 style0 n">2019</td>
<td class="column8 style0 s">TV-PG</td>
<td class="column9 style0 s">90 min</td>
<td class="column10 style0 s">Children & Family Movies, Comedies</td>
<td class="column11 style0 s">Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from an evil archaeologist first.</td>
#... continue to row100
我正在尝试开发一个返回列表列表或字典列表的函数,以回答有关数据的一些问题。我知道用于获取文本的 get_text() 函数,但不确定如何真正实现其余部分。我对 python 很陌生,所以非常感谢任何帮助。
解决方案
您可以使用:
from bs4 import BeautifulSoup as bs
with open("nf_shows.html", encoding="utf-8") as f:
html = f.read()
soup = bs(html, "html5lib")
table = soup.find("tbody").find_all("tr")
headers = [x.text.strip() for x in table[0].find_all("td")]
tv_shows = []
for tv_show in table[1:]:
vals = [x.text.strip() for x in tv_show.find_all("td")]
tv_dict = dict(zip(headers, vals))
tv_shows.append(tv_dict)
{'show_id': '81145628', 'type': 'Movie', 'title': 'Norm of the North: King Sized Adventure', 'director': 'Richard Finn, Tim Maltby', 'cast': 'Alan Marriott, Andrew Toth, Brian Dobson, Cole Howard, Jennifer Cameron, Jonathan Holmes, Lee Tockar, Lisa Durupt, Maya Kay, Michael Dobson', 'country': 'United States, India, South Korea, China', 'date_added': 'September 9, 2019', 'release_year': '2019', 'rating': 'TV-PG', 'duration': '90 min', 'listed_in': 'Children & Family Movies, Comedies', 'description': 'Before planning an awesome wedding for his grandfather, a polar bear king must take back a stolen artifact from an evil archaeologist first.'},
...
推荐阅读
- python - 读取不一致的文本数据并写入csv
- glsl - 通过 GLSL/WebGL 旋转图像
- google-sheets - 当 Google 表格中的单元格为空白时,IFS 不起作用
- node.js - Hapi.js 中的 AsyncLocalStorage
- android - Flutter Release Apk(未安装应用程序)错误
- azure - 使用 DevOps 在 Azure 中设置计划时如何获取 GMT 标准时间
- android - 获取 unicode 符号而不是表情符号
- ios - Flutter 无法在 ios 模块上构建 'cloud_firestore' not found @import cloud_firestore;
- apache-kafka - 如何在 minikube 中启动 kafka
- mysql - JOIN 后不在