首页 > 解决方案 > 无法使用 BeautifulSoup 从 select_one 中获取文本

问题描述

我正在尝试从下面的 HTML 中解析时间,但无法使用get_textwithselect_one来提取data-published-datedatetime<time class = "published-date relative-date" ... /time>.

<div class="content">
       <header>
        <h3 class="article-name">
         Curious Kids: Why is the Moon Called the Moon?
        </h3>
        <p class="byline">
         <span class="by-author">
          By
          <span style="white-space:nowrap">
           Toby Brown
          </span>
         </span>
         <time class="published-date relative-date" data-published-date="2019-12-13T12:00:28Z" datetime="2019-12-13T12:00:28Z">
         </time>
        </p>
       </header>

使用:

import requests
from bs4 import BeautifulSoup
url = 'https://www.space.com/news'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')

contents = soup.select('.content')
headlines = []
for item in contents:
  h_line = item.select_one('.article-name').get_text()
  author = item.select_one('.byline > span:nth-of-type(1) > span:nth-of-type(1)').get_text().strip()
  synopsis = item.select_one('.synopsis').get_text().strip() 
  date = item.select_one('.byline > time').get_text() 
  newsline = {'Headline': h_line, 'Author': author, 'Synopsis': synopsis, 'Date': dates}
  headlines.append(newsline) 

for line in headlines:   
  print(line)  

产生一个回溯错误,声称它是“NoneType”。此外,答案只能使用 BeautifulSoup 解析,不能使用 RegEx。

***更新:我修改了答案以便能够在我的 for 循环中使用(这样我就可以遍历所有标题的源代码)

import requests
from bs4 import BeautifulSoup
url = 'https://www.space.com/news'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')

contents = soup.select('.content')
headlines = []
for item in contents:
  h_line = item.select_one('.article-name').get_text()
  author = item.select_one('.byline > span:nth-of-type(1) > span:nth-of-type(1)').get_text().strip()
  synopsis = item.select_one('.synopsis').get_text().strip() 
  dates = item.select_one('time').get('data-published-date')
  newsline = {'Headline': h_line, 'Author': author, 'Synopsis': synopsis, 'Date & Time Published': dates}
  headlines.append(newsline) 

for line in headlines:   
  print(line)   

标签: pythonhtmlpython-3.xparsingbeautifulsoup

解决方案


from bs4 import BeautifulSoup
data = """
<div class="content">
       <header>
        <h3 class="article-name">
         Curious Kids: Why is the Moon Called the Moon?
        </h3>
        <p class="byline">
         <span class="by-author">
          By
          <span style="white-space:nowrap">
           Toby Brown
          </span>
         </span>
         <time class="published-date relative-date" data-published-date="2019-12-13T12:00:28Z" datetime="2019-12-13T12:00:28Z">
         </time>
        </p>
       </header>
"""


soup = BeautifulSoup(data, 'html.parser')

for item in soup.findAll('time', {'class': 'published-date relative-date'}):
    print(item.get('data-published-date'))

输出:

2019-12-13T12:00:28Z

深版:

import requests
from bs4 import BeautifulSoup

r = requests.get('https://www.space.com/news')
soup = BeautifulSoup(r.text, 'html.parser')

headline = []
author = []
syn = []
time = []
for item in soup.findAll('h3', {'class': 'article-name'}):
    headline.append(item.text)
for item in soup.findAll('span', {'style': 'white-space:nowrap'}):
    author.append(item.get_text(strip=True))
for item in soup.findAll('p', {'class': 'synopsis'}):
    syn.append(item.get_text(strip=True))
for item in soup.findAll('time', {'class': 'published-date relative-date'}):
    time.append(item.get('data-published-date'))

for item in zip(headline, author, syn, time):
    print(item)

推荐阅读