首页 > 解决方案 > 使用 BS4 解析日期

问题描述

我有下面的 HTML 数据部分,想提取日期信息(例如 18 年 12 月 31 日)。感谢是否有人可以使用 BS4 分享指导手。

<th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-19</time></th><th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-18</time></th><th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-17</time></th><th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-16</time></th><th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-15</time></th>

我使用 bs4 解析器选项“时间”,所有条目的文本数据(例如 15 年 12 月 31 日)都丢失了,有人知道为什么吗?

import requests
page = equests.get("https://www.reuters.com/companies/MBBM.KL/financials")
from bs4 import BeautifulSoup
soup = BeautifulSoup(page.content, 'html.parser')
soup.find_all('time')

[<time class="TextLabel__text-label___3oCVw TextLabel__gray___1V4fk TextLabel__regular___2X0ym"></time>, <time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg"></time>, <time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg"></time>, <time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg"></time>, <time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg"></time>, <time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg"></time>]
>>>

标签: htmlbeautifulsoup

解决方案


试试这个:

from bs4 import  BeautifulSoup
URL = 'th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-19</time></th><th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-18</time></th><th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-17</time></th><th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-16</time></th><th scope="column"><time class="TextLabel__text-label___3oCVw TextLabel__black___2FN-Z TextLabel__medium___t9PWg">31-Dec-15</time></th>'


soup = BeautifulSoup(URL, "html.parser")

times = [time.get_text() for time in soup.select('time')]
for time in times:
    print(time)

印刷:

31-Dec-19
31-Dec-18
31-Dec-17
31-Dec-16
31-Dec-15

编辑以从站点使用 selenium 获得时间:

from selenium import webdriver

driver = webdriver.Firefox(executable_path='c:/program/geckodriver.exe')

driver.get('https://www.reuters.com/companies/MBBM.KL/financials')
driver.implicitly_wait(5)
times = driver.find_elements_by_css_selector('time')

for time in times[1:]:
    print(time.text)
driver.close()

输出:

31-Dec-19
31-Dec-18
31-Dec-17
31-Dec-16
31-Dec-15

请注意,您需要seleniumgeckodriver,在这种情况下,我从c:/program/geckodriver.exe


推荐阅读