首页 > 解决方案 > 用没有类关键字的 html 创建一个字典或数据框

问题描述

每次进行交易时,我都会从银行收到交易电子邮件。它以 html 格式出现。我希望能够从 html 内容中获取某些信息,例如确认号码、日期、金额等。

我尝试使用正则表达式提取和 BeautifulSoup,但结果丑陋且笨拙。例如,html 代码没有任何有用的属性,因此使用属性过滤器进行 find() 并不容易。请参阅下面的 html 代码片段:

<table style="border: 1px solid black; border-collapse: collapse">
    <tbody>
        <tr>
            <td colspan="2" style="border:1px solid black;padding:3px">
                <center>
                    <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                        <b>
                            Transfer Money Details
                        </b>
                    </font>
                </center>
            </td>
        </tr>
        <tr>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Confirmation Number
                </font>
            </td>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    1594379907846
                </font>
            </td>
        </tr>
        <tr>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Transaction Date and Time
                </font>
            </td>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Friday, Jul 10 2020; 07:18:54 PM (GMT +8)
                </font>
            </td>
        </tr>
        <tr>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Transfer From
                </font>
            </td>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    XXXX-XXX-247 (PESO SAVINGS)
                </font>
            </td>
        </tr>
        <tr>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Transfer To
                </font>
            </td>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    XXXX-XXX-545
                </font>
            </td>
        </tr>
        <tr>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Amount
                </font>
            </td>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    PHP 1,200.00
                </font>
            </td>
        </tr>
        <tr>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Service Fee
                </font>
            </td>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    PHP 0.00
                </font>
            </td>
        </tr>
        <tr>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Total Amount
                </font>
            </td>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    PHP 1,200.00
                </font>
            </td>
        </tr>
        <tr>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Notes
                </font>
            </td>
            <td style="border: 1px solid black; padding: 3px">
                <font color="#000000" face="arial" style="FONT-SIZE:10pt">
                    Mask filters
                </font>
            </td>
        </tr>
    </tbody>
</table>

我希望能够拥有如下所示的数据框或字典:

{
'Confirmation Number': '1594379907846',
'Transaction Date and Time': 'Friday, Jul 10 2020; 07:18:54 PM (GMT +8)',
'Transfer From': 'XXXX-XXX-247 (PESO SAVINGS)'
    ... and so on
}

我的代码:

def get_content(html_content):
  soup = BeautifulSoup(html_content, 'html.parser')
  rows = soup.find_all('tr')

  content_ls = []
  trans_details = {}
  for row in rows:
    cells = row.findChildren('td')
    for cell in cells:
      content_ls.append(cell.getText())

  trans_details['Confirmation Number'] = content_ls[2]
  trans_details['Date_Time'] = content_ls[4]
  trans_details['From'] = content_ls[6]
  trans_details['To'] = content_ls[8]
  trans_details['Amount'] = content_ls[10]
  trans_details['Notes'] = content_ls[12]

  return trans_details

产生这个字典:

{'Amount': 'PHP 1,200.00',
 'Confirmation Number': '1594379907846',
 'Date_Time': 'Friday, Jul 10 2020; 07:18:54 PM (GMT +8)',
 'From': 'XXXX-XXX-247 (PESO SAVINGS)',
 'Notes': 'PHP 0.00',
 'To': 'XXXX-XXX-545'}

有没有更优雅和 Pythonic 的方式来做到这一点?

最终,我想生成一个 DataFrame,其中包含“确认号”、“交易日期和时间”等列。

谢谢

标签: pythonbeautifulsoup

解决方案


你可以做的是使用lxmllib. 它允许您使用 xpath 来查找元素。这是一种使用您提供的 HTML 提取信息的方法。

def parse(html):
    root = etree.fromstring(html)
    trs = root.xpath("//tr")
    result = dict()
    for tr in trs:
        fonts = tr.xpath(".//font")
        key = fonts[0].text.strip()
        value = fonts[1].text.strip()
        result[key] = value

    return result

推荐阅读