python-3.x - 使用 BS4 在抓取 Trustpilot 审核日期时遇到问题
问题描述
鉴于我的以下代码,我无法获得评级和相应的日期。
我可以得到评级,但不能使用 .text。它得到了整个结果:
</div>, <div class="star-rating star-rating--medium">
<img alt="5 stars: Excellent" src="//cdn.trustpilot.net/brand-assets/4.1.0/stars/stars-5.svg"/>
这意味着我有一些清洁工作要做,但我确信只能获得“5 星:优秀”。只是不知道该怎么做。
至于日期,我的 "date = star.find("div", attrs={"class":"tooltip-container-1"})" 行只给我 None 值,我不知道为什么。
请在下面查看我的代码、评级的 HTML 和日期。
我的代码:
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"}
#def get_total_items(url):
#soup = BeautifulSoup(requests.get(url, format(0),headers).text, 'lxml')
stars = []
dates = []
with requests.Session() as s:
for num in range(1,2):
url = "https://www.trustpilot.com/review/www.boozt.com?page={}".format(num)
r = s.get(url, headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
for star in soup.find_all("section", attrs={"class":"review__content"}):
rating = star.find("div", attrs={"class":"star-rating star-rating--medium"})
date = star.find("div", attrs={"class":"tooltip-container-1"})
#print(rating)
stars.append(rating)
dates.append(date)
#data = {"Rating": stars, "Dates": dates}
time.sleep(2)
print(dates)
来自 Trustpilot 的评级 html:
<div class="star-rating star-rating--medium">
<img src="//cdn.trustpilot.net/brand-assets/4.1.0/stars/stars-5.svg" alt="5 stars: Excellent">
</div>
来自 Trustpilot 的日期 html:
<div class="v-popover">
<span aria-describedby="popover_o7e1fd7whi" class="trigger" style="display: inline-block;">
<time datetime="2020-01-20T10:09:54.000Z" title="Monday, January 20, 2020, 11:09:54 AM" class="review-date--tooltip-target">Jan 20, 2020</time>
<div class="tooltip-container-1"></div> <!----></span> </div>
解决方案
首先,要获得评分值,例如“5 星:优秀”,您只需alt
从with类img
下读取属性div
star-rating star-rating--medium
然后,要获取日期值,这有点棘手,因为您的目标日期是由 javascript 加载的。但是你可以从script
上面的标签中得到它。像这样:star.find('script')
我对您的代码片段进行了一些更新,我们在这里:
代码:
import requests
from bs4 import BeautifulSoup
import time
import json
headers = {"User-Agent":"Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:47.0) Gecko/20100101 Firefox/47.0"}
#def get_total_items(url):
#soup = BeautifulSoup(requests.get(url, format(0),headers).text, 'lxml')
stars = []
dates = []
results = []
with requests.Session() as s:
for num in range(1,2):
url = "https://www.trustpilot.com/review/www.boozt.com?page={}".format(num)
r = s.get(url, headers = headers)
soup = BeautifulSoup(r.content, 'lxml')
for star in soup.find_all("section", {"class":"review__content"}):
# Get rating value
rating = star.find("div", {"class":"star-rating star-rating--medium"}).find('img').get('alt')
# Get date value
date_json = json.loads(star.find('script').text)
date = date_json['publishedDate']
stars.append(rating)
dates.append(date)
data = {"Rating": rating, "Date": date}
results.append(data)
time.sleep(2)
print(results)
结果:
[{'Date': '2020-01-28T05:37:13Z', 'Rating': '5 stars: Excellent'},
{'Date': '2020-01-28T00:00:48Z', 'Rating': '5 stars: Excellent'},
{'Date': '2020-01-27T23:22:58Z', 'Rating': '5 stars: Excellent'},
{'Date': '2020-01-27T21:20:32Z', 'Rating': '5 stars: Excellent'},
{'Date': '2020-01-27T21:06:42Z', 'Rating': '5 stars: Excellent'},
{'Date': '2020-01-27T19:37:16Z', 'Rating': '5 stars: Excellent'},
{'Date': '2020-01-27T19:27:38Z', 'Rating': '2 stars: Poor'},
{'Date': '2020-01-27T18:20:48Z', 'Rating': '5 stars: Excellent'},
{'Date': '2020-01-27T17:18:42Z', 'Rating': '5 stars: Excellent'},
{'Date': '2020-01-27T16:15:17Z', 'Rating': '5 stars: Excellent'},
{'Date': '2020-01-27T15:58:49Z', 'Rating': '5 stars: Excellent'},
{'Date': '2020-01-27T15:46:29Z', 'Rating': '5 stars: Excellent'},
{'Date': '2020-01-27T15:39:23Z', 'Rating': '5 stars: Excellent'},
{'Date': '2020-01-27T15:32:43Z', 'Rating': '5 stars: Excellent'},
{'Date': '2020-01-27T15:29:21Z', 'Rating': '5 stars: Excellent'},
{'Date': '2020-01-27T15:27:30Z', 'Rating': '5 stars: Excellent'},
{'Date': '2020-01-27T14:35:29Z', 'Rating': '5 stars: Excellent'},
{'Date': '2020-01-27T13:43:40Z', 'Rating': '5 stars: Excellent'},
{'Date': '2020-01-27T13:37:53Z', 'Rating': '5 stars: Excellent'},
{'Date': '2020-01-27T12:58:58Z', 'Rating': '5 stars: Excellent'}]
推荐阅读
- html - Html 表单“动作属性”
- android - 从一个片段导航到另一个片段时,第二个片段重叠
- autodesk-forge - Using BIM360 API Get project user?
- java - Logback 日志记录 maven 多模块
- ios - Facebook 横幅广告未正确显示
- salesforce - 使用自定义字段从 Opportunity 到 Quote 的交叉 fromula 字段
- python - PyGithub 克隆功能的 Python 进度条
- matplotlib - 如何修复最后一个子图的大小?
- python - 每 5 秒动态更新我的 html 页面以显示新数据的问题(使用烧瓶)
- php - 数据未使用 PDO PHP 中的数组 foreach 函数插入 DB