python - 无法使用 BeautifulSoup 从 select_one 中获取文本
问题描述
我正在尝试从下面的 HTML 中解析时间,但无法使用get_text
withselect_one
来提取data-published-date
或datetime
在<time class = "published-date relative-date" ... /time>
.
<div class="content">
<header>
<h3 class="article-name">
Curious Kids: Why is the Moon Called the Moon?
</h3>
<p class="byline">
<span class="by-author">
By
<span style="white-space:nowrap">
Toby Brown
</span>
</span>
<time class="published-date relative-date" data-published-date="2019-12-13T12:00:28Z" datetime="2019-12-13T12:00:28Z">
</time>
</p>
</header>
使用:
import requests
from bs4 import BeautifulSoup
url = 'https://www.space.com/news'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
contents = soup.select('.content')
headlines = []
for item in contents:
h_line = item.select_one('.article-name').get_text()
author = item.select_one('.byline > span:nth-of-type(1) > span:nth-of-type(1)').get_text().strip()
synopsis = item.select_one('.synopsis').get_text().strip()
date = item.select_one('.byline > time').get_text()
newsline = {'Headline': h_line, 'Author': author, 'Synopsis': synopsis, 'Date': dates}
headlines.append(newsline)
for line in headlines:
print(line)
产生一个回溯错误,声称它是“NoneType”。此外,答案只能使用 BeautifulSoup 解析,不能使用 RegEx。
***更新:我修改了答案以便能够在我的 for 循环中使用(这样我就可以遍历所有标题的源代码)
import requests
from bs4 import BeautifulSoup
url = 'https://www.space.com/news'
response = requests.get(url)
data = response.text
soup = BeautifulSoup(data, 'html.parser')
contents = soup.select('.content')
headlines = []
for item in contents:
h_line = item.select_one('.article-name').get_text()
author = item.select_one('.byline > span:nth-of-type(1) > span:nth-of-type(1)').get_text().strip()
synopsis = item.select_one('.synopsis').get_text().strip()
dates = item.select_one('time').get('data-published-date')
newsline = {'Headline': h_line, 'Author': author, 'Synopsis': synopsis, 'Date & Time Published': dates}
headlines.append(newsline)
for line in headlines:
print(line)
解决方案
from bs4 import BeautifulSoup
data = """
<div class="content">
<header>
<h3 class="article-name">
Curious Kids: Why is the Moon Called the Moon?
</h3>
<p class="byline">
<span class="by-author">
By
<span style="white-space:nowrap">
Toby Brown
</span>
</span>
<time class="published-date relative-date" data-published-date="2019-12-13T12:00:28Z" datetime="2019-12-13T12:00:28Z">
</time>
</p>
</header>
"""
soup = BeautifulSoup(data, 'html.parser')
for item in soup.findAll('time', {'class': 'published-date relative-date'}):
print(item.get('data-published-date'))
输出:
2019-12-13T12:00:28Z
深版:
import requests
from bs4 import BeautifulSoup
r = requests.get('https://www.space.com/news')
soup = BeautifulSoup(r.text, 'html.parser')
headline = []
author = []
syn = []
time = []
for item in soup.findAll('h3', {'class': 'article-name'}):
headline.append(item.text)
for item in soup.findAll('span', {'style': 'white-space:nowrap'}):
author.append(item.get_text(strip=True))
for item in soup.findAll('p', {'class': 'synopsis'}):
syn.append(item.get_text(strip=True))
for item in soup.findAll('time', {'class': 'published-date relative-date'}):
time.append(item.get('data-published-date'))
for item in zip(headline, author, syn, time):
print(item)
推荐阅读
- google-chrome - 是否可以将 cookie 从子域设置到父域?
- node.js - 为什么我会收到此错误?>> TypeError: User is not a constructor (使用 Express + sequelize CLI + Mysql)
- sql - 统计每个月的会员人数
- flutter - Flutter Google 表格 FormatException:Unexpected character (at character 1)r - GGplot2 条形图 - 将两个 y 值映射到 1 个 x 值
- typescript - 使用先进的前沿算法填充网格孔
- python - 为列表的变量名称赋值
- html2pdf - 为什么用阿拉伯语文本 html2pdf 删除句子最后一个单词之前的最后一个空格?
- javascript - Ionic-4 ion-alert-controller 创建功能不起作用
- c# - 自定义组件在设计器中工作,但在代码中不可见