python - JSONDecodeError:期望值:第 1 行第 1 列(字符 0)与 json.loads(片段)
问题描述
我是从“数据科学的实用网络抓取”开始练习网络抓取的新手。当我回溯时,我遇到了“JSONDecodeError:期望值:第 1 行第 1 列(字符 0)”我从一开始就有问题。如果有人帮助我,那将对我非常有帮助。
# Required packages
import requests
import json
import re
from bs4 import BeautifulSoup as bs
import dataset
# Creating Dataset into Mongodb / SQLite
db = dataset.connect('sqlite:/// reviews.db')
review_url = 'https://www.amazon.com/ss/customer-reviews/ajax/reviews/get/'
product_id = '1449355730'
session = requests.Session()
session.headers.update({
'User-Agent' : 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36' +
'(KHTML, like Gecko) Chrome/ 62.0.0.3202.62 Safari/537.36'
})
session.get('https://www.amazon.com/product-reviews/{}/'.format(product_id))
def parse_reviews(reply):
reviews = []
for fragment in reply.split('&&&'):
if not fragment.strip():
continue
json_fragment = json.loads(fragment)
if json_fragment[0] != 'append':
continue
html_soup = bs(json_fragment[2], 'html.parser')
div = html_soup.find('div', class_='review')
if not div:
continue
review_id = div.get('id')
# find & clean the rating :
review_classes = ' '.join(html_soup.find(class_ = 'review-rating').get('class'))
rating = re.search('a-star-(\d+)', review_classes).group(1)
title = html_soup.find(class_='review-title').get_text(strip = True)
review = html_soup.find(class_='review-text').get_text(strip = True)
review.append({'review_id' : review_id,
'rating' : rating,
'title' : title,
'review' : review})
return reviews
def get_reviews(product_id, page):
data = {
'sortBy' : '',
'reveiwerType' : 'all_reviews',
'formatType' : '',
'mediaType' : '',
'filterByStar' : 'all_stars',
'pageNumber' : page,
'filterByKeyword' : '',
'shouldAppend' : 'undefined',
'deviceType' : 'desktop',
'reftag' : 'cm_cr_getr_d_paging_btm_{}'.format(page),
'pageSize' : 15,
'asin' : product_id,
'scope' : 'reviewsAjax1'
}
r = session.post(review_url + 'ref=' + data['reftag'], data = data)
reviews = parse_reviews(r.text)
return reviews
page = 1
while True:
print("Scraping page", page)
reviews = get_reviews(product_id, page)
if not reviews:
break
for review in reviews:
print(' -', review['rating'], review['title'])
db['reviews'].upsert(review, ['review_id'])
page += 1
以下错误消息给我 -
**JSONDecodeError** Traceback (most recent call last)
<ipython-input-5-75cef79b98a4> in <module>
60 while True:
61 print("Scraping page", page)
---> 62 reviews = get_reviews(product_id, page)
63 if not reviews:
64 break
<ipython-input-5-75cef79b98a4> in get_reviews(product_id, page)
54 }
55 r = session.post(review_url + 'ref=' + data['reftag'], data = data)
---> 56 reviews = parse_reviews(r.text)
57 return reviews
58
<ipython-input-5-75cef79b98a4> in parse_reviews(reply)
17 if not fragment.strip():
18 continue
---> 19 json_fragment = json.loads(fragment)
20 if json_fragment[0] != 'append':
21 continue
**JSONDecodeError:** Expecting value: line 1 column 1 (char 0)
请帮我解决这个问题,我尝试了其中的所有内容,但仍然卡住了。提前致谢
解决方案
如前所述,fragment
可能不是有效的 json 格式(当我检查时,它不是)。我怀疑这本书已经过时了几年,所以他们使用的示例/代码可能不起作用。只是玩了一轮,看起来亚马逊确实改变了一些东西。
这确实对我有用,我注意到了细微的变化,以便您进行比较。我还注释掉了 mongoDB 的东西,因为这更多的是网络抓取问题。我不知道该部分是否会为您带来任何错误:
# Required packages
import requests
import json
import re
from bs4 import BeautifulSoup as bs
#import dataset
# Creating Dataset into Mongodb / SQLite
#db = dataset.connect('sqlite:/// reviews.db')
review_url = 'https://www.amazon.com/hz/reviews-render/ajax/reviews/get/' #<-- slight change
product_id = '1449355730'
session = requests.Session()
session.headers.update({'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.190 Safari/537.36'
})
url = 'https://www.amazon.com/product-reviews/{}/'.format(product_id)
session.get(url)
def parse_reviews(reply):
reviews = []
for fragment in reply.split('&&&'):
if not fragment.strip():
continue
json_fragment = json.loads(fragment)
if json_fragment[0] != 'append':
continue
html_soup = bs(json_fragment[2], 'html.parser')
div = html_soup.find('div', {'data-hook':'review'}) #<-- changed
if not div:
continue
review_id = div.get('id')
# find & clean the rating :
review_classes = ' '.join(html_soup.find(class_ = 'review-rating').get('class'))
rating = re.search('a-star-(\d+)', review_classes).group(1)
title = html_soup.find(class_='review-title').get_text(strip = True)
review = html_soup.find(class_='review-text').get_text(strip = True)
reviews.append({'review_id' : review_id, #<-- here may be a typo. should be reviews that you are appending to
'rating' : rating,
'title' : title,
'review' : review})
return reviews
def get_reviews(product_id, page):
data = {
'sortBy' : '',
'reveiwerType' : 'all_reviews',
'formatType' : '',
'mediaType' : '',
'filterByStar' : 'all_stars',
'pageNumber' : page,
'filterByKeyword' : '',
'shouldAppend' : 'undefined',
'deviceType' : 'desktop',
'reftag' : 'cm_cr_getr_d_paging_btm_{}'.format(page),
'pageSize' : 15,
'asin' : product_id,
'scope' : 'reviewsAjax2' #<-- changed
}
r = session.post(review_url + 'ref=' + data['reftag'], data = data)
reviews = parse_reviews(r.text)
return reviews
page = 1
while True:
print("Scrapping page", page)
reviews = get_reviews(product_id, page)
if not reviews:
break
for review in reviews:
print(' -', review['rating'], review['title'])
#db['reviews'].upsert(review, ['review_id'])
page += 1
输出:
Scrapping page 1
- 5 Best Python book for a beginner
- 2 Thorough but bloated
- 5 let me try to explain why this 1600 page book may actually end up saving you a lot of time and making you a better Python progra
- 3 Very dense. Too much apology for being dense. Very detailed, yet inefficient.
- 5 The book is long because it's thorough, and it's a quality book
- 4 The Python Bible - not for beginners
- 1 Making Python, and programming, the most boring experience you can think of
- 4 Not great for learning, good object oriented chapters
- 5 Perfect for ... in-between noob and professional, and wanting a deep understanding
- 3 I think there might be an excellent 300-page book somewhere in these 1500 pages
- 5 A Mark Lutz Trifecta of Python Winners
- 5 Perfect for self-learners of Python
- 5 Excellent Reference (Probably not for beginners)
- 3 I'm glad it's here but it needs to be two books.
- 4 From Noob to Expert
Scrapping page 2
- 5 This is the real deal. The full Python experience
- 1 Incredibly verbose and repetitve.
- 5 Very good Python beginner to intermediate book for an experienced programmer
- 1 Bloated and not very useful
- 5 Yeah it's that long for a reason
- 3 Not bad, but not recommended, especially not for beginners.
- 2 Too much fluff
- 5 This is most comprehensive for beginner to build solid foundation for python programming! Must buy! Believe me!
- 3 Broad, but occasionally confusing and unfocused
- 4 Really Good Overall, But Long-Winded
- 5 Book is up-to-date despite publication date
- 5 This is the BEST book on the Python programming language I have found.
- 5 Highly recommend for the new user (avoid being put off by the length of the text)
- 5 Terrific book
- 5 Great start, and written for the novice
Scrapping page 3
- 4 Great Book but, geez, 8-point type?
- 5 Incredibly detailed, thorough, but not a quick read
- 2 Very wordy beginning programming with Python.
- 5 A great tool for achieving Python programming expertise
- 3 Brief and honest review
....
推荐阅读
- powershell - 如何通过 PowerShell 创建 Azure DevOps 服务连接端点
- spring-boot - 使用 springboot 应用程序在 oracle 云存储上上传时,zip/image 文件上传损坏
- javascript - 如何在 Typescript 中缩短这个条件?
- r - 您如何将 st_interpolate_aw 与合法包含点和线交叉点的多边形图层一起使用?
- java - 如何检查字符串是否同时包含字母和数字?
- python - 如何连接三个图像?
- spring-boot - Spring Boot 服务消耗太多内存
- lftp - LFTP 不能为每个文件夹下载超过 9998 个文件
- c# - ASP.NET GridView 中列标题的不同高度
- java - 序列化和反序列化错误:本地类不兼容:流 classdesc serialVersionUID = ,本地类 serialVersionUID = -