python - 网页上可以看到评论,但是 BeautifulSoup 返回的 html 对象不包含评论部分
问题描述
我尝试使用 URL 链接从网页中提取评论的文本内容,并使用 BeautifulSoup 进行抓取。当我单击 URL 链接时,页面上可以看到评论的内容,但 BeautifulSoup 返回的 HTML 对象不包含这些标签和文本。
我使用 BeautifulSoup 和 'html.parser' 来进行网络抓取。我成功提取了给定网页中视频的点赞数/观看次数/评论数,但 HTML 文件中不包含评论部分的信息。我使用的浏览器是 Chrome,系统是 Ubuntu 18.04.1 LTS。
这是我使用的代码(在 python 中):
from urllib.request import urlopen
from bs4 import BeautifulSoup
import os
webpage_link = "https://www.airvuz.com/video/Majestic-Beast-Nanuk?id=59b2a56141ab4823e61ea901"
try:
page = urlopen(webpage_link)
except urllib.error.HTTPError as err: # webpage cannot be found
print("ERROR! %s" %(webpage_link))
soup = BeautifulSoup(page, 'html.parser')
预期的结果是汤对象包含网页上可见的所有内容,尤其是评论的文本内容(例如“不在那儿,我很享受看到白熊的生活方式。感谢提供此类纪录片的提供者。”和“哇……太棒了……”);但是,我在汤对象中找不到相应的节点。任何帮助,将不胜感激!
解决方案
注释由 JavasSript 通过 ajax 请求生成。您可以发送相同的请求并从json
响应中获取评论。您可以使用检查工具中的网络选项卡找到请求。
from urllib.request import urlopen
from bs4 import BeautifulSoup, Comment
import json
webpage_link = "https://www.airvuz.com/api/comments/video/59b2a56141ab4823e61ea901?page=1&limit=20"
page = urlopen(webpage_link).read()
comments_json=data = json.loads(page)
for comment_info in comments_json['data']:
print(comment_info['comment'].strip())
输出
Not being there I enjoyed a lot seeing the life style of white bear. Thanks to the provider for such documentary.
WOOOW... amazing...
I've been photographing polar bears for years, but to see this footage from a drones perspective was epic! Well done and congratz on the Nominee! Well deserved.
You are da man Florian!
Absolutely outstanding!
This is incredible
jaw dropping
This is wow amazing, love it.
So cool! Did the bears react to the drone at all?
Congratulations! It's awesome! I am watching in tears....
Awesome!
perfect video awesome
It is very, very beautiful !!! Sincere congratulations
Made my day, exquisite, thank you
Wow
Super!
Marvelous!
Man this is incredible!
Material is good, but edi is bad. This history about beer's family...
Muy bueno!
推荐阅读
- c# - 如何从进程外后台任务发送电子邮件
- mysql - MySQL:如果 GROUPS 中的所有项目都不匹配,则 LEFT JOIN 的 WHERE 条件
- javascript - 对象数组:建议通过对象键或数组索引或数组项本身选择特定项目?
- jquery - 带有动态 jQuery 选择器的 XSS
- cloud-init - cloud/scripts/per-boot 中的 Cloud-init 每次启动脚本
- python-3.x - 使用 chrome 驱动程序单击此按钮的 Python 命令行
- python - 如何根据特殊键将大 JSON 拆分为多个较小的 JSON?
- python - get only digit from scraping data
- oracle - Could not execute impdp command from docker exec command
- laravel - Create pdf for signature list using Laravel and dompdf