首页 > 解决方案 > Python BeautifulSoup 仅段落文本

问题描述

我对任何与网页抓取相关的东西都很陌生,据我所知,Requests 和 BeautifulSoup 是其中的一种方式。我想编写一个程序,它每隔几个小时只给我发送一个给定链接的一段(尝试一种新的方式来阅读全天的博客)说这个特定的链接' https://fs.blog/mental-models/ '每个型号都有一个段落。

from bs4 import BeautifulSoup
import re
import requests


url = 'https://fs.blog/mental-models/'

r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

现在汤在段落文本开始之前有一堵墙:<p> this is what I want to read </p>

soup.title.string工作得很好,但我不知道如何从这里继续前进......有什么方向吗?

谢谢

标签: pythonbeautifulsoup

解决方案


循环soup.findAll('p')查找所有p标签,然后用于.text获取它们的文本:

此外,由于您不想要页脚段落,因此请div在课程下执行所有操作。rte

from bs4 import BeautifulSoup
import requests

url = 'https://fs.blog/mental-models/'    
r = requests.get(url)
soup = BeautifulSoup(r.content, 'html.parser')

divTag = soup.find_all("div", {"class": "rte"})    
for tag in divTag:
    pTags = tag.find_all('p')
    for tag in pTags[:-2]:  # to trim the last two irrelevant looking lines
        print(tag.text)

输出

Mental models are how we understand the world. Not only do they shape what we think and how we understand but they shape the connections and opportunities that we see.
.
.
.
5. Mutually Assured Destruction
Somewhat paradoxically, the stronger two opponents become, the less likely they may be to destroy one another. This process of mutually assured destruction occurs not just in warfare, as with the development of global nuclear warheads, but also in business, as with the avoidance of destructive price wars between competitors. However, in a fat-tailed world, it is also possible that mutually assured destruction scenarios simply make destruction more severe in the event of a mistake (pushing destruction into the “tails” of the distribution).

 


推荐阅读