首页 > 解决方案 > 如何使用 bs4 获取 p 标签中的文本

问题描述

我正在尝试从新闻网站上抓取数据,现在我需要 p 标签中的文本。

我搜索了很多,但所有解决方案要么返回“无”,要么引发此错误:

Traceback (most recent call last):
  File "E:/Python/News Uploader to Google Driver/venv/Scripts/main.py", line 41, in <module>
    contents = parse(text)
  File "E:/Python/News Uploader to Google Driver/venv/Scripts/main.py", line 28, in parse
    article = soup.find("div", {"class": "content_text row description"}).findAll('p')
AttributeError: 'NoneType' object has no attribute 'findAll
def parse(url):
    html = requests.get(url)
    #array_of_paragraphs = [""]
    soup = BeautifulSoup(html.content, 'html5lib')
    text = []
    text = soup.find("div", {"class": "content_text row description"}).findAll('p')
    for t in text:
       text = ''.join(element.findAll(text=True))
    return text

网址目前是: https ://gadgets.ndtv.com/mobiles/news/samsung-galaxy-a-series-56-percent-q2-smartphone-sales-share-counterpoint-2112319

您可以将其用于测试目的

除了“无”消息或错误外,控制台上不显示任何内容

标签: pythonweb-scrapingbeautifulsoup

解决方案


将子 p 添加到由类定义的父级

import requests
from bs4 import BeautifulSoup as bs

headers = {'User-Agent':'Mozilla/5.0'}
r = requests.get('https://gadgets.ndtv.com/mobiles/news/samsung-galaxy-a-series-56-percent-q2-smartphone-sales-share-counterpoint-2112319', headers = headers)
soup = bs(r.content, 'lxml')
print('\n'.join([i.text for i in soup.select('.description p')]))

import requests
from bs4 import BeautifulSoup as bs

def parse(url):
    headers = {'User-Agent':'Mozilla/5.0'}
    r = requests.get(url, headers = headers)
    soup = bs(r.content, 'lxml')
    text = '\n'.join([i.text for i in soup.select('.description p')])
    return text

parse('https://gadgets.ndtv.com/mobiles/news/samsung-galaxy-a-series-56-percent-q2-smartphone-sales-share-counterpoint-2112319')

推荐阅读