首页 > 解决方案 > 如何从表格行中抓取特定单词?

问题描述

我只想使用 python 从下表中抓取代码

在此处输入图像描述

如图所示,您可以看到我只想抓取CPT、CTC、PTC、STC、SPT、HTC、P5TC、P1A、P2A P3A、P1E、P2E、P3E。此代码可能会不时更改,例如添加 P4E 或删除 P1E。

上表的 HTML 代码为:

<table class="list">
   <tbody>
      <tr>
         <td>
            <p>PRODUCT<br>DESCRIPTION</p>
         </td>
         <td>
            <p><strong>Time Charter:</strong> CPT, CTC, PTC, STC, SPT, HTC, P5TC<br><strong>Time Charter Trip:</strong> P1A, P2A, P3A,<br>P1E, P2E, P3E</p>
         </td>
         <td><strong>Voyage: </strong>C3E, C4E, C5E, C7E</td>
      </tr>
      <tr>
         <td>
            <p>CONTRACT SIZE</p>
            <p></p>
         </td>
         <td>
            <p>1 day</p>
         </td>
         <td>
            <p>1,000 metric tons</p>
         </td>
      </tr>
      <tr>
         <td>
            <p>MINIMUM TICK</p>
            <p></p>
         </td>
         <td>
            <p>US$ 25</p>
         </td>
         <td>
            <p>US$ 0.01</p>
         </td>
      </tr>
      <tr>
         <td>
            <p>FINAL SETTLEMENT PRICE</p>
            <p></p>
         </td>
         <td colspan="2" rowspan="1">
            <p>The floating price will be the end-of-day price as supplied by the Baltic Exchange.</p>
            <p><br><strong>All products:</strong> Final settlement price will be the mean of the daily Baltic Exchange spot price assessments for every trading day in the expiry month.</p>
            <p><br><strong>Exception for P1A, P2A, P3A:</strong> Final settlement price will be the mean of the last 7 Baltic Exchange spot price assessments in the expiry month.</p>
         </td>
      </tr>
      <tr>
         <td>
            <p>CONTRACT SERIES</p>
         </td>
         <td colspan="2" rowspan="1">
            <p><strong><strong>CTC, CPT, PTC, STC, SPT, HTC, P5TC</strong>:</strong> Months, quarters and calendar years out to a maximum of 72 months</p>
            <p><strong>C3E, C4E, C5E, C7E, P1A, P2A, P3A, P1E, P2E, P3E:</strong> Months, quarters and calendar years out to a maximum of 36 months</p>
         </td>
      </tr>
      <tr>
         <td>
            <p>SETTLEMENT</p>
         </td>
         <td colspan="2" rowspan="1">
            <p>At 13:00 hours (UK time) on the last business day of each month within the contract series</p>
         </td>
      </tr>
   </tbody>
</table>

您可以从以下网站链接中查看代码

https://www.eex.com/en/products/global-commodities/freight

标签: pythonseleniumxpathbeautifulsoupcss-selectors

解决方案


如果变量txt包含您问题中的 HTML,则此脚本会提取所有必需的代码:

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(txt, 'html.parser')
text = soup.select_one('td:contains("Time Charter:")').text
codes = re.findall(r'[A-Z\d]{3}', text)

print(codes)

印刷:

['CPT', 'CTC', 'PTC', 'STC', 'SPT', 'HTC', 'P5T', 'P1A', 'P2A', 'P3A', 'P1E', 'P2E', 'P3E']

编辑:要从所有表中获取代码,您可以使用此脚本:

import re
from bs4 import BeautifulSoup

soup = BeautifulSoup(txt, 'html.parser')
all_codes = []
for td in soup.select('td:contains("Time Charter:")'):
    all_codes.extend(re.findall(r'[A-Z\d]{3}', td.text))
print(all_codes)

推荐阅读