python - 如何将 HTML 表格转换为 Python 字典
问题描述
我有以下 HTML 摘录,格式为 Python 列表,我想将其转换为字典。这是一周中每天的时间表。
[u'
<table class="hours table">\n
<tbody>\n
<tr>\n
<th scope="row">Mon</th>\n
<td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
<br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
<td class="extra">\n </td>\n </tr>\n\n
<tr>\n
<th scope="row">Tue</th>\n
<td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
<br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
<td class="extra">\n </td>\n </tr>\n\n
<tr>\n
<th scope="row">Wed</th>\n
<td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
<br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
<td class="extra">\n <span class="nowrap open">Open now</span>\n </td>\n </tr>\n\n
<tr>\n
<th scope="row">Thu</th>\n
<td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
<br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
<td class="extra">\n </td>\n </tr>\n\n
<tr>\n
<th scope="row">Fri</th>\n
<td>\n <span class="nowrap">2:00 pm</span> - <span class="nowrap">3:00 pm</span>
<br><span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
<td class="extra">\n </td>\n </tr>\n\n
<tr>\n
<th scope="row">Sat</th>\n
<td>\n <span class="nowrap">5:00 pm</span> - <span class="nowrap">10:00 pm</span>\n </td>\n
<td class="extra">\n </td>\n </tr>\n\n
<tr>\n
<th scope="row">Sun</th>\n
<td>\n Closed\n </td>\n
<td class="extra">\n </td>\n </tr>\n\n </tbody>\n </table>']
如意输出是:
{
'Mon': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'],
'Tue': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'],
'Wed': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'],
'Thu': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'],
'Fri': ['2:00pm - 3:00pm', '5:00pm - 10:00pm'],
'Sat': '5:00pm - 10:00pm',
'Sun': 'Closed'
}
你将如何在 Python 3.x 中实现这一点?我不介意“Sat”和“Sun”键是否具有列表格式的值,如果这有帮助的话。提前感谢您的想法。
解决方案
这是一个解决方案,它首先读入 Pandas DataFrame,然后转换为您想要的输出中的字典:
import pandas as pd
dfs = pd.read_html(html_string)
df = dfs[0] # pd.read_html reads in all tables and returns a list of DataFrames
给予:
0 1 2
0 Mon 2:00 pm - 3:00 pm 5:00 pm - 10:00 pm NaN
1 Tue 2:00 pm - 3:00 pm 5:00 pm - 10:00 pm NaN
2 Wed 2:00 pm - 3:00 pm 5:00 pm - 10:00 pm Open now
3 Thu 2:00 pm - 3:00 pm 5:00 pm - 10:00 pm NaN
4 Fri 2:00 pm - 3:00 pm 5:00 pm - 10:00 pm NaN
5 Sat 5:00 pm - 10:00 pm NaN
6 Sun Closed NaN
然后使用groupby
和字典理解:
summary = {k: v.iloc[0, 1].split(' ') for k, v in df.groupby(0)}
给予:
{'Fri': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
'Mon': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
'Sat': ['5:00 pm - 10:00 pm'],
'Sun': ['Closed'],
'Thu': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
'Tue': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm'],
'Wed': ['2:00 pm - 3:00 pm', '5:00 pm - 10:00 pm']}
如果仅在两个空格上拆分并不总是适用于您的开放时间数据格式,您可能需要稍微进行编辑。
推荐阅读
- reactjs - 这是否可以破解 React 渲染方法来为键分配自定义前缀?
- database - 如何在 Laravel 上插入数据“单选按钮”?
- python - 从字符串末尾删除特定单词
- javascript - jquery not function 没有按预期工作
- python - 如何匹配整个数据框中的元素并返回该特定匹配元素的整行或索引?
- corda - 测试网似乎不支持带有 SignatureConstraint 的 Corda4
- javascript - javascript 括号之间的一元加号
- javascript - 为什么编辑 CSS 文件不会像 JS 那样自动实时更新?
- c++ - 为什么 class::class::class::staticClassMember() 编译(在 C++ 中)?
- react-native - React-native 中的最近应用程序图标