python-3.x - 我无法在 Python 中抓取数据以跟踪 HTML
问题描述
我正在尝试从 MouthShut.com 用户评论中抓取数据。如果我正在查看评论 Devtools,则评论所需的文本位于以下标签内。- 更多评论数据
<div class="more reviewdata"> Ipohone 11 Pro X : Looks alike a minion having Three Eyes. yes its Seems as An Alien, But Technically Iphone is Copying features and Function of Androids and Having Custom Os Phones.Triple Camera is Great! for Wide Angle Photography.But The looks of Iphone 11 pro X isn't Good.If ...<a style="cursor:pointer" onclick="bindreviewcontent('2958778',this,false,'I found this review of Apple iPhone 11 Pro Max 512GB pretty useful',925993570,'.png','I found this review of Apple iPhone 11 Pro Max 512GB pretty useful %23WriteShareWin','https://www.mouthshut.com/review/Apple-iPhone-11-Pro-Max-512GB-review-omnstsstqun','Apple iPhone 11 Pro Max 512GB',' 1/5','omnstsstqun');">Read More</a></div>
我只想提取评论的文本内容,任何人都可以帮助如何提取,因为它没有唯一的分隔符。
我已经完成了以下代码:
from requests import get
bse_url = 'https://www.mouthshut.com/mobile-phones/Apple-iPhone-11-Pro-Max-reviews-925993567'
response = get(url)
print(response.text[:100])
from bs4 import BeautifulSoup
html_soup = BeautifulSoup(response.text, 'html.parser')
type(html_soup)
reviews = html_soup.find_all('div', class_ = 'more reviewdata')
print(type(reviews))
print(len(reviews))
first_review = reviews[2]
first_review.div
解决方案
要从页面中抓取所有评论,您可以使用此示例。一些较大的评论作为 POST 请求单独抓取:
import re
import requests
from textwrap import wrap
from bs4 import BeautifulSoup
base_url = 'https://www.mouthshut.com/mobile-phones/Apple-iPhone-11-Pro-Max-reviews-925993567'
data = {
'type': 'review',
'reviewid': -1,
'corp': 'false',
'catname': ''
}
more_url = 'https://www.mouthshut.com/review/CorporateResponse.ashx'
output = []
with requests.session() as s:
soup = BeautifulSoup(s.get(base_url).text, 'html.parser')
for review in soup.select('.reviewdata'):
a = review.select_one('a[onclick^="bindreviewcontent"]')
if a:
data['reviewid'] = re.findall(r"bindreviewcontent\('(\d+)", a['onclick'])[0]
comment = BeautifulSoup( s.post(more_url, data=data).text, 'html.parser' )
comment.div.extract()
comment.ul.extract()
output.append( comment.get_text(separator=' ', strip=True) )
else:
review.div.extract()
output.append( review.get_text(separator=' ', strip=True) )
for i, review in enumerate(output, 1):
print('--- Review no.{} ---'.format(i))
print(*wrap(review), sep='\n')
print()
印刷:
--- Review no.1 ---
As you all know Apple products are too expensive this one is damn one
but who needs to sell his kidney to buy its look is not that much ease
than expected. For me it's 2 star phone
--- Review no.2 ---
Very disappointing product.nothing has changed in operating system,
only camera look has changed which is very odd looking.Device weight
is not light and dont fit in one hand.
--- Review no.3 ---
Ipohone 11 Pro X : Looks alike a minion having Three Eyes. yes its
Seems as An Alien, But Technically Iphone is Copying features and
Function of Androids and Having Custom Os Phones. Triple Camera is
Great! for Wide Angle Photography. But The looks of Iphone 11 pro X
isn't Good. If You Have 3 Kidneys, Then You Can Waste one of them to
... and so on.
推荐阅读
- signals - 如何通过python计算信号的instfreq?
- javascript - 未捕获的 CKEditorError:角度 11 中的 ckeditor-duplicated-modules
- microsoft-teams - 无法从团队中的 get authtoken 方法获取令牌返回错误“清单中定义的应用程序资源和 iframe 来源不匹配”
- php - Laravel 属于多由两个中间表?
- python - 创建一个 LSTM 网络来训练和预测多个时间序列
- mongodb - MongoDB 聚合查询将状态分组到区域中
- unit-testing - 如何使用 CodeceptJS 对 JS 函数进行单元测试
- linux - 赛普拉斯强制远程下载
- python - 无法在 VS Code 中更改 Jupyter 内核
- javascript - 嵌套 ES6 类