python - 如何使用 bs4 获取 p 标签中的文本
问题描述
我正在尝试从新闻网站上抓取数据,现在我需要 p 标签中的文本。
我搜索了很多,但所有解决方案要么返回“无”,要么引发此错误:
Traceback (most recent call last):
File "E:/Python/News Uploader to Google Driver/venv/Scripts/main.py", line 41, in <module>
contents = parse(text)
File "E:/Python/News Uploader to Google Driver/venv/Scripts/main.py", line 28, in parse
article = soup.find("div", {"class": "content_text row description"}).findAll('p')
AttributeError: 'NoneType' object has no attribute 'findAll
def parse(url):
html = requests.get(url)
#array_of_paragraphs = [""]
soup = BeautifulSoup(html.content, 'html5lib')
text = []
text = soup.find("div", {"class": "content_text row description"}).findAll('p')
for t in text:
text = ''.join(element.findAll(text=True))
return text
您可以将其用于测试目的
除了“无”消息或错误外,控制台上不显示任何内容
解决方案
将子 p 添加到由类定义的父级
import requests
from bs4 import BeautifulSoup as bs
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get('https://gadgets.ndtv.com/mobiles/news/samsung-galaxy-a-series-56-percent-q2-smartphone-sales-share-counterpoint-2112319', headers = headers)
soup = bs(r.content, 'lxml')
print('\n'.join([i.text for i in soup.select('.description p')]))
import requests
from bs4 import BeautifulSoup as bs
def parse(url):
headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get(url, headers = headers)
soup = bs(r.content, 'lxml')
text = '\n'.join([i.text for i in soup.select('.description p')])
return text
parse('https://gadgets.ndtv.com/mobiles/news/samsung-galaxy-a-series-56-percent-q2-smartphone-sales-share-counterpoint-2112319')
推荐阅读
- javascript - 按标题属性过滤元素
- dart - 将列表移动到用于多个 dart 文件的单独 dart 文件中?
- java - 如何正确地将 Java 构造器翻译成 Kotlin?
- php - BLOB 图像仅在 phpMyAdmin 中上传时显示,通过 HTML 表单上传时不显示
- swift - 混淆音频流格式和核心音频的数据类型
- c# - 如何取消或重启Task继承List并等待?
- c# - Net Core 2 等效的 HandleErrorAttribute
- mfc - 将 CTaskDialog 的宽度设置为屏幕宽度的 50%
- python - 根据测试用例拆分输入
- android - 如何使用注释在改造请求中动态添加授权标头?