python - 如何仅从链接中提取段落部分,不包括网页中的其他链接?
问题描述
我正在尝试从网页中提取句子,但我无法排除该网页中显示的其他链接或侧面图标。
我试图从网页(意思是段落)中找到所有出现的“p”,但我也得到了其他不需要的结果。
我的代码:
import re
from nltk import word_tokenize, sent_tokenize, ngrams
from collections import Counter
from urllib import request
from bs4 import BeautifulSoup
url = "https://www.usatoday.com/story/sports/nba/rockets/2019/01/25/james-harden-30-points-22-consecutive-games-rockets-edge-raptors/2684160002/"
html = request.urlopen(url).read().decode('utf8')
raw = BeautifulSoup(html,"lxml")
partags = raw.find_all('p') #to extract only paragraphs
print(partags)
我得到以下输出(作为图像发布,因为复制粘贴看起来不那么整洁)
[![enter image description here][1]][1]
https://i.stack.imgur.com/rGC1P.png
但我想从链接中只提取这种句子,是否有任何额外的过滤器可以应用。
[![在此处输入图像描述][1]][1]
https://i.stack.imgur.com/MlPUV.png '
Code after Valery's feedback.
partags = raw.get_text()
print(partags)
我得到的输出(它也有 JSON 格式的链接和其他链接)
This is just sample from the full output:
James Harden extends 30-point streak, makes key defensive stop
{
"@context": "http://schema.org",
"@type": "NewsArticle",
"headline": "James Harden extends 30-point streak, makes key defensive stop to help Rockets edge Raptors",
"description": "James Harden scored 35 points for his 22nd consecutive game with at least 30, and forced Kawhi Leonard into a missed 3 at buzzer for 121-119 win.",
"url": "https://www.usatoday.com/story/sports/nba/rockets/2019/01/25/james-harden-30-points-22-consecutive-games-rockets-edge-raptors/2684160002/?utm_source=google&utm_medium=amp&utm_campaign=speakable",
"mainEntityOfPage": {
"@type": "WebPage",
"@id": "https://www.usatoday.com/story/sports/nba/rockets/2019/01/25/james-harden-30-points-22-consecutive-games-rockets-edge-raptors/2684160002/"
},
解决方案
关于这个 BeautifulSoup/bs4/doc/#get-text 的 bs4 文档
import requests
from bs4 import BeautifulSoup as bs
response = requests.get("https://www.usatoday.com/story/sports/nba/rockets/2019/01/25/james-harden-30-points-22-consecutive-games-rockets-edge-raptors/2684160002/")
html = response.text
raw = bs(html, "html")
for partag in raw.find_all('p'):
print(partag.get_text())
这是结果的链接
因此,在 partags(段落标签)上调用 get_text() 会产生没有噪音的有效文本。
推荐阅读
- django-rest-framework - 将临时访客(临时用户)添加到 Django Rest 框架
- javascript - 如何获取数据表列的未修改的json数据
- python - 使用操作系统扫描目录中的文件时在文件名中获取 \u200b
- javascript - 在活动内容状态之间切换
- java - 如何从 EDT 向长期运行的 SwingWorker 传递对象?
- python - AWS Elastic Beanstalk Python Django S3 Access Denied 无法上传/读取文件
- javascript - 用 Visual Studio JavaScript 分离 .send 和 .ban(异步)?
- matlab - 在 Matlab 中围绕 x 或 y 轴旋转图像
- spring - Spring Integration - SftpPersistentAcceptOnceFileListFilter 过滤掉新文件/修改过的文件
- javascript - 如何在 Bootstrap 5 中使用 popperjs 使下拉菜单工作