python - 使用 Selenium 和 BeautifulSoup 抓取饥饿游戏的用户评分
问题描述
我正在尝试从 goodreads.com 上获取第一本书《饥饿游戏三部曲》的所有用户评分(满分 5 分)。最大的挑战是有多页评论,但是当显示另一页评论时链接不会改变。这就是为什么我在寻找一组新的评级时使用 Selenium 进行导航的原因。
下面你可以看到我的代码:
# initiating the chromedriver
path_to_chromedriver = r'./chromedriver.exe'
#launch url
url = "https://www.goodreads.com/book/show/2767052-the-hunger-games"
# create a new Chrome session
driver = webdriver.Chrome(executable_path=path_to_chromedriver)
driver.implicitly_wait(30)
driver.get(url)
# initiating the beautifulsoup
soup_1=BeautifulSoup(driver.page_source, 'lxml')
# finding the table that includes all the book reviews
user = soup_1.find('div', {'id': 'bookReviews'})
# finding all the individual ratings from that table
user = user.find_all('div',{'class':'friendReviews elementListBrown'})
# locating the next button on the page which is indicated with 'next »'
elm = driver.find_element_by_partial_link_text('next »')
for i in range(9): # since there are 10 pages of reviews
for row in user: # finding for each separate rating
rating = {}
try: # try and except is needed because not all the users have a rating
rating['name'] = row.find('a',{'class': 'user'}).text # grabbing the username
rating['rating'] = row.find('span',{'class':'staticStars'})['title'] # grabbing user rating out of 5
ratings.append(rating)
except:
pass
elm.click() # clicking on the next button to scrape the other page
df_rev = pd.DataFrame(ratings) # merging all the results to build a data frame
df_rev
最后,我希望得到每个评分的用户和他们的评分。而不是这样,我最终得到了一个数据框,该数据框仅从第一页的评分中重复了很多次用户及其评分,从第一个用户开始,直到第一页上的最后一个用户。
结果:
name rating
0 Kiki liked it
1 Saniya it was amazing
2 Khanh it was amazing
3 Dija it was amazing
4 Nataliya really liked it
5 Jana did not like it
6 Cecily it was ok
7 Kiki liked it
8 Saniya it was amazing
9 Khanh it was amazing
10 Dija it was amazing
11 Nataliya really liked it
12 Jana did not like it
13 Cecily it was ok
14 Kiki liked it
15 Saniya it was amazing
16 Khanh it was amazing
17 Dija it was amazing
18 Nataliya really liked it
19 Jana did not like it
20 Cecily it was ok
21 Kiki liked it
22 Saniya it was amazing
23 Khanh it was amazing
24 Dija it was amazing
25 Nataliya really liked it
26 Jana did not like it
27 Cecily it was ok
...
解决方案
好吧,据我所知,您甚至还没有初始化ratings
。
但是,我做了一些小改动,它似乎正在工作。关于您的代码,我会更改一些结构性的东西。嗯,其实很多。但我想你的答案不需要它。
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import os, sys
import pandas as pd
import pdfkit as pdf
import time
from bs4 import BeautifulSoup
driveletter = os.getcwd().split(':')[0]
options = Options()
options.binary_location = driveletter+":\PortableApps\GoogleChromePortable\App\Chrome-bin\chrome.exe"
options.add_argument('--headless')
driver = webdriver.Chrome(options=options, executable_path=driveletter+":\PortableApps\GoogleChromePortable\App\Chrome-bin\chromedriver.exe", )
#launch url
url = "https://www.goodreads.com/book/show/2767052-the-hunger-games"
# create a new Chrome session
driver.get(url)
ratings = list()
last_page_source = ''
while True:
page_changed = False # It's useful to declare whether the page has changed or not
attempts = 0
while(not page_changed):
if last_page_source != driver.page_source:
page_changed = True
else:
if attempts > 5: # Decide on some point when you want to give up.
break;
else:
time.sleep(3) # Give time to load new page. Interval could be shorter.
attempts += 1
if page_changed:
soup_1 = BeautifulSoup(driver.page_source, 'lxml')
user = soup_1.find('div', {'id': 'bookReviews'})
user = user.find_all('div',{'class':'friendReviews elementListBrown'})
for row in user: # finding for each separate rating
rating = {}
try:
# try and except is needed because not all the users have a rating
rating['name'] = row.find('a',{'class': 'user'}).text # grabbing the username
rating['rating'] = row.find('span',{'class':'staticStars'})['title'] # grabbing user rating out of 5
ratings.append(rating)
except:
pass
last_page_source = driver.page_source
next_page_element = driver.find_element_by_class_name('next_page')
driver.execute_script("arguments[0].click();", next_page_element) # clicking on the next button to scrape the other page
else:
df_rev = pd.DataFrame(ratings) # merging all the results to build a data frame
print(df_rev.drop_duplicates())
break;
输出:
name rating
0 Kiki liked it
1 Saniya it was amazing
2 Khanh, first of her name, mother of bunnies it was amazing
3 Dija it was amazing
4 Nataliya really liked it
5 Jana did not like it
6 Cecily it was ok
7 Meredith Holley it was amazing
8 Jayson really liked it
9 Chelsea Humphrey really liked it
10 Miranda Reads really liked it
11 ~Poppy~ really liked it
12 elissa it was amazing
13 Colleen Venable really liked it
14 Betsy it was amazing
15 Emily May really liked it
16 Lyndsey it was amazing
17 Morgan F it was amazing
18 Huda Yahya liked it
19 Nilesh Kashyap it was ok
20 Buggy it was amazing
21 Tessa liked it
22 Jamie it was amazing
23 Richard Derus did not like it
24 Maggie Stiefvater it was amazing
25 karen it was amazing
26 James it was amazing
27 Kai it was amazing
28 Brandi did not like it
29 Will Byrnes liked it
.. ... ...
263 shre ♡ it was amazing
264 Diane really liked it
265 Margaret Stohl it was amazing
266 Athena Shardbearer it was amazing
267 Ashley liked it
268 Geo Marcovici it was amazing
269 Pinky it was amazing
270 Mariel really liked it
271 Jim liked it
272 Frannie Pan it was amazing
273 Zanna really liked it
274 Χαρά Ζ. really liked it
275 Anzu The Great Destroyer really liked it
276 Beth it was amazing
277 Karla really liked it
278 Carla did not like it
279 Shawna it was amazing
280 Susane Colasanti it was amazing
281 Cherie really liked it
283 David Firmage liked it
284 Farith it was amazing
285 Tony DiTerlizzi it was amazing
286 Christy it was amazing
287 Emerald it was amazing
288 Sandra it was amazing
289 Chiara Pagliochini really liked it
290 Argona it was amazing
291 NZLisaM it was amazing
292 Vinaya it was amazing
293 Mac Ross it was amazing
[292 rows x 2 columns]
说明:您根据初始链接的源页面初始化了 beautifulsoup。您从未更改此以及您为更改此源页面所做的单击。
编辑:由于我在原始回复中犯了错误,因此不得不进行一些更改。
推荐阅读
- javascript - Ext JS Grid 视图中的 sessionStorage 代理实现面临的问题
- javascript - TypeError:未定义不是未定义的对象'createElement' - React Native
- java - Restcomm JDBC 资源适配器多个数据源
- ios - 模拟器应用预览视频像素化
- matlab - 根据“表面”值提取 3D 矩阵的列 - 向量化
- sql - 查询一对多关系 SQL
- .htaccess - .htaccess 重定向带问号
- jhipster - JHipster 生成的应用程序登录失败显示错误消息
- encoding - FFmpeg WebM AV1 支持
- elasticsearch - 6.2 中本地的弹性节点