python - 如何从表格行中抓取特定单词?
问题描述
我只想使用 python 从下表中抓取代码
如图所示,您可以看到我只想抓取CPT、CTC、PTC、STC、SPT、HTC、P5TC、P1A、P2A P3A、P1E、P2E、P3E。此代码可能会不时更改,例如添加 P4E 或删除 P1E。
上表的 HTML 代码为:
<table class="list">
<tbody>
<tr>
<td>
<p>PRODUCT<br>DESCRIPTION</p>
</td>
<td>
<p><strong>Time Charter:</strong> CPT, CTC, PTC, STC, SPT, HTC, P5TC<br><strong>Time Charter Trip:</strong> P1A, P2A, P3A,<br>P1E, P2E, P3E</p>
</td>
<td><strong>Voyage: </strong>C3E, C4E, C5E, C7E</td>
</tr>
<tr>
<td>
<p>CONTRACT SIZE</p>
<p></p>
</td>
<td>
<p>1 day</p>
</td>
<td>
<p>1,000 metric tons</p>
</td>
</tr>
<tr>
<td>
<p>MINIMUM TICK</p>
<p></p>
</td>
<td>
<p>US$ 25</p>
</td>
<td>
<p>US$ 0.01</p>
</td>
</tr>
<tr>
<td>
<p>FINAL SETTLEMENT PRICE</p>
<p></p>
</td>
<td colspan="2" rowspan="1">
<p>The floating price will be the end-of-day price as supplied by the Baltic Exchange.</p>
<p><br><strong>All products:</strong> Final settlement price will be the mean of the daily Baltic Exchange spot price assessments for every trading day in the expiry month.</p>
<p><br><strong>Exception for P1A, P2A, P3A:</strong> Final settlement price will be the mean of the last 7 Baltic Exchange spot price assessments in the expiry month.</p>
</td>
</tr>
<tr>
<td>
<p>CONTRACT SERIES</p>
</td>
<td colspan="2" rowspan="1">
<p><strong><strong>CTC, CPT, PTC, STC, SPT, HTC, P5TC</strong>:</strong> Months, quarters and calendar years out to a maximum of 72 months</p>
<p><strong>C3E, C4E, C5E, C7E, P1A, P2A, P3A, P1E, P2E, P3E:</strong> Months, quarters and calendar years out to a maximum of 36 months</p>
</td>
</tr>
<tr>
<td>
<p>SETTLEMENT</p>
</td>
<td colspan="2" rowspan="1">
<p>At 13:00 hours (UK time) on the last business day of each month within the contract series</p>
</td>
</tr>
</tbody>
</table>
您可以从以下网站链接中查看代码
解决方案
如果变量txt
包含您问题中的 HTML,则此脚本会提取所有必需的代码:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
text = soup.select_one('td:contains("Time Charter:")').text
codes = re.findall(r'[A-Z\d]{3}', text)
print(codes)
印刷:
['CPT', 'CTC', 'PTC', 'STC', 'SPT', 'HTC', 'P5T', 'P1A', 'P2A', 'P3A', 'P1E', 'P2E', 'P3E']
编辑:要从所有表中获取代码,您可以使用此脚本:
import re
from bs4 import BeautifulSoup
soup = BeautifulSoup(txt, 'html.parser')
all_codes = []
for td in soup.select('td:contains("Time Charter:")'):
all_codes.extend(re.findall(r'[A-Z\d]{3}', td.text))
print(all_codes)
推荐阅读
- rust - wasm-pack 代码中的 Rust 导入导致 JS 错误
- openssl - openSSL SSL_CTX_set1_sigalgs() API 返回 0(失败)
- javascript - 如何单击父 div 内的所有链接?
- node.js - node.js 服务器:服务于虚拟子目录
- mysql - 当在 SELECT 中使用计数变量时,有人可以解释 MySQL 中 ORDER BY 和 SELECT 的执行顺序吗?
- python - 如何使用 pods/exec Api-Group 将文件从远程 pod 复制到当前 pod [远程文件是 pcap]
- javascript - 将电子窗口从左下角调整到右上角?
- css - 如何用 React 用效果替换图像?
- javascript - 使用 test.BeforeAll 转到页面 URL 以获取 playwright-test runner
- reactjs - SWR 维护的缓存在哪里,客户端还是服务器端?