python - Web Scraping:通过 Python 抓取多个网页
问题描述
from bs4 import BeautifulSoup
import requests
url = 'https://uk.trustpilot.com/review/thread.com'
for pg in range(1, 10):
pg = url + '?page=' + str(pg)
soup = BeautifulSoup(page.content, 'lxml')
for paragraph in soup.find_all('p'):
print(paragraph.text)
我想从https://uk.trustpilot.com/review/thread.com抓取排名、审查和审查日期,但是,我不知道如何从多个页面中抓取并为抓取结果
解决方案
您好,您需要向每个页面发送请求,然后处理响应。此外,由于某些项目不能直接作为标签中的文本使用,因此您可以从 javascript 中获取它(我使用 json 加载这样的日期)或从类名中获取它(我得到这样的评级)。
from bs4 import BeautifulSoup
import pandas as pd
final_list=[]#final list to be the df
import json
import requests
final_list=[]
url = 'https://uk.trustpilot.com/review/thread.com'
for pg in range(1, 3):
pg = url + '?page=' + str(pg)
r=requests.get(pg)
soup = BeautifulSoup(r.text, 'lxml')
for paragraph in soup.find_all('section',class_='review__content'):
title=paragraph.find('h2',class_='review-content__title').text.strip()
content=paragraph.find('p',class_='review-content__text').text.strip()
datedata= json.loads(paragraph.find('div',class_='review-content-header__dates').text)
date=datedata['publishedDate'].split('T')[0]
rating_class=paragraph.find('div',class_='star-rating')['class']
rating=rating_class[1].split('-')[-1]
final_list.append([title,content,date,rating])
df = pd.DataFrame(final_list,columns=['Title','Content','Date','Rating'])
print(df)
输出
Title Content Date Rating
0 I ordered a jacket 2 weeks ago I ordered a jacket 2 weeks ago. Still hasn't ... 2019-01-13 1
1 I've used this service for many years… I've used this service for many years and get ... 2018-12-31 4
2 Great website Great website, tailored recommendations, and e... 2018-12-19 5
3 I was excited by the prospect offered… I was excited by the prospect offered by threa... 2018-12-18 1
4 Thread set the benchmark for customer service Firstly, their customer service is second to n... 2018-12-12 5
5 It's a good idea It's a good idea. I am in between sizes and d... 2018-12-02 3
6 Great experience so far Great experience so far. Big choice of clothes... 2018-10-31 5
7 Absolutely love using Thread.com Absolutely love using Thread.com. As a man wh... 2018-10-31 5
8 I'd like to give Thread a one star… I'd like to give Thread a one star review, but... 2018-10-30 2
9 Really enjoying the shopping experience… Really enjoying the shopping experience on thi... 2018-10-22 5
10 The only way I buy clothes I absolutely love Thread. I've been surviving ... 2018-10-15 5
11 Excellent Service Excellent ServiceQuick delivery, nice items th... 2018-07-27 5
12 Convenient way to order clothes online Convenient way to order clothes online, and gr... 2018-07-05 5
13 Superb - would thoroughly recommend Recommendations have been brilliant - no more ... 2018-06-24 5
14 First time ordering from Thread First time ordering from Thread - Very slow de... 2018-06-22 1
15 Some of these criticisms are just madness I absolutely love thread.com, and I can't reco... 2018-05-28 5
16 Top service! Great idea and fantastic service. I just recei... 2018-05-17 5
17 Great service Great service. Great clothes which come well p... 2018-05-05 5
18 Thumbs up Easy, straightforward and very good costumer s... 2018-04-17 5
19 Good idea, ruined by slow delivery I really love the concept and the ordering pro... 2018-04-08 3
20 I love Thread I have been using thread for over a year. It i... 2018-03-12 5
21 Clever simple idea but.. low quality clothing Clever simple idea but.. low quality clothingL... 2018-03-12 2
22 Initially I was impressed.... Initially I was impressed with the Thread shop... 2018-02-07 2
23 Happy new customer Joined the site a few weeks ago, took a short ... 2018-02-06 5
24 Style tips for mature men I'm a man of mature age, let's say a "baby boo... 2018-01-31 5
25 Every shop, every item and in one place Simple, intuitive and makes online shopping a ... 2018-01-28 5
26 Fantastic experience all round Fantastic experience all round. Quick to regi... 2018-01-28 5
27 Superb "all in one" shopping experience … Superb "all in one" shopping experience that i... 2018-01-25 5
28 Great for time poor people who aren’t fond of ... Rally love this company. Super useful for thos... 2018-01-22 5
29 Really is worth trying! Quite cautious at first, however, love the way... 2018-01-10 4
30 14 days for returns is very poor given … 14 days for returns is very poor given most co... 2017-12-20 3
31 A great intro to online clothes … A great intro to online clothes shopping. Usef... 2017-12-15 5
32 I was skeptical at first I was skeptical at first, but the service is s... 2017-11-16 5
33 seems good to me as i hate to shop in … seems good to me as i hate to shop in stores, ... 2017-10-23 5
34 Great concept and service Great concept and service. This service has be... 2017-10-17 5
35 Slow dispatch My Order Dispatch was extremely slow compared ... 2017-10-07 1
36 This company sends me clothes in boxes This company sends me clothes in boxes! I find... 2017-08-28 5
37 I've been using Thread for the past six … I've been using Thread for the past six months... 2017-08-03 5
38 Thread Thread, this site right here is literally the ... 2017-06-22 5
39 good concept The website is a good concept in helping buyer... 2017-06-14 3
注意:虽然我能够“破解”我的方式来获取该站点的结果,但最好使用 selenium 来抓取动态页面。
编辑:自动找出页数的代码
from bs4 import BeautifulSoup
import math
import pandas as pd
final_list=[]#final list to be the df
import json
import requests
final_list=[]
url = 'https://uk.trustpilot.com/review/thread.com'
#making a request to get the number of reviews
r=requests.get(url)
soup = BeautifulSoup(r.text, 'lxml')
review_count_h2=soup.find('h2',class_="header--inline").text
review_count=int(review_count_h2.strip().split(' ')[0].strip())
#there are 20 reviews per page so pages can be calculated as
pages=int(math.ceil(review_count/20))
#change range to 1 to pages+1
for pg in range(1, pages+1):
pg = url + '?page=' + str(pg)
r=requests.get(pg)
soup = BeautifulSoup(r.text, 'lxml')
for paragraph in soup.find_all('section',class_='review__content'):
try:
title=paragraph.find('h2',class_='review-content__title').text.strip()
content=paragraph.find('p',class_='review-content__text').text.strip()
datedata= json.loads(paragraph.find('div',class_='review-content-header__dates').text)
date=datedata['publishedDate'].split('T')[0]
rating_class=paragraph.find('div',class_='star-rating')['class']
rating=rating_class[1].split('-')[-1]
final_list.append([title,content,date,rating])
except AttributeError:
pass
df = pd.DataFrame(final_list,columns=['Title','Content','Date','Rating'])
print(df)
推荐阅读
- c# - 使用 protobuf 流式传输压缩的 IDataReader
- python - 在哪里可以使用安装在 python 虚拟环境中的包?
- batch-file - 在文本文件中的每一行的最后一个分隔符之后提取字符串
- python - 即使在将 Black 配置为格式化程序后,获取“扩展 'Python Language Basics' 无法格式化 ~'/'”
- vue.js - 如何将 ASP Core Web API VueJS 站点部署到 IIS
- groovy - 利用流出外部进程的返回值的问题
- java - 使用 java 来操作 minecraft 服务器输入/输出
- rust - 如果在 Rust book 20.3 中将发送 Terminate 消息和 thread.join() 放在一个循环中,为什么会出现死锁?
- python - Selenium python获取网站中资源(图像,脚本,css)的4xx和5xx列表
- c# - 打印item中item的索引