python - 如何从 Python 中的 get 请求中解析返回的 Javascript 代码
问题描述
我正在发送以下获取请求
<a href="#" onclick="new Ajax.Request('/book/reviews/4981?authenticity_token=vxZvklgqILI3SBwtJLDN5DicJKt93LiOWxYwFa%2BrWDdsJxTTAs46WvPN3L1PKNW3qpmacr%2BnWYXexhR%2BfoB3Cw%3D%3D&amp;hide_last_page=true&amp;language_code=en&amp;page=4', {asynchronous:true, evalScripts:true, method:'get', parameters:'authenticity_token=' + encodeURIComponent('FUvf1v6N9TgtBKVmo5I3YLm3yVwb//WU9zZDdj1oWd3GeqSXpGnv0OmBZfbICi8zK7J3hdmEFJ9y5mcd7EN24Q==')}); return false;">4</a>
在 Python 中,我写成
import urllib
URL = 'https://www.goodreads.com/book/reviews/4981?authenticity_token=vxZvklgqILI3SBwtJLDN5DicJKt93LiOWxYwFa%2BrWDdsJxTTAs46WvPN3L1PKNW3qpmacr%2BnWYXexhR%2BfoB3Cw%3D%3D&amp;hide_last_page=true&amp;language_code=en&amp;page=4'
s = 'FUvf1v6N9TgtBKVmo5I3YLm3yVwb//WU9zZDdj1oWd3GeqSXpGnv0OmBZfbICi8zK7J3hdmEFJ9y5mcd7EN24Q=='
PARAMS = {'asynchronous':True,
'evalScripts':True,
'method':'get',
'parameters':'authenticity_token=' + urllib.parse.quote(s.encode("utf-8"))
}
r = requests.get(url = URL, params = PARAMS)
我是新手,但它似乎被编码成不是 ASCII 外观的文本。返回的代码还包含 HTML 代码,这正是我想要的。这是返回的一部分:
b'Element.update("reviews", "\\n\\u003cdiv class=\\"bookReviewsPaginationCount\\"\\u003e\\n
\\u003cspan class=\\"smallText\\"\\u003e\\nShowing 91-120\\n\\u003c/span\\u003e\\n\\n\\u003c/div\\u003e\\n\\n\\n\\u003cdiv id=\\"reviewControls\\"\\n class=\\"reviewControls u-defaultType clearFix\\"\\u003e\\n \\u003cdiv class=\\"reviewControls--left\\"\\u003e\\n
\\u003cspan class=\\"stars staticStars notranslate\\"\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p10\\"\\u003e\\u003c/span\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p10\\"\\u003e\\u003c/span\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p10\\"\\u003e\\u003c/span\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p10\\"\\u003e\\u003c/span\\u003e\\u003cspan size=\\"12x12\\" class=\\"staticStar p3\\"\\u003e\\u003c/span\\u003e\\u003c/span\\u003e\\n
\\u003cspan class=\\"u-visuallyHidden\\"\\u003eAverage rating\\u003c/span\\u003e\\n 4.07\\n \\u003cspan class=\\"greyText\\"\\u003e\\u0026nbsp;\\u0026middot;\\u0026nbsp;\\u003c/span\\u003e\\n \\u003c/div\\u003e\\n \\u003cdiv class=\\"reviewControls__ratingDetails reviewControls--left rating_graph\\"\\u003e\\n \\u003cspan id=\\"reviewControls__ratingDetailsMiniGraph\\"\\u003e\\n \\u003cscript type=\\"text/javascript\\"\\u003e\\n
//\\u003c![CDATA[\\n $j(document).ready(function() {\\n var vis = renderRatingGraph(\\n [436969, 351497, 175037, 52003, 27985],\\n \\"reviewControls__ratingDetailsMiniGraph\\");\\n $j(\\"#reviewControls__ratingDetailsMiniGraph\\").prependTo(\\"#rating_details_tip\\");\\n });\\n
有没有办法解析代码?我试过了:
BeautifulSoup 从 javascript(编码)变量中抓取
但它不适用于我返回的代码。
谢谢
解决方案
返回的字符串看起来像用于使用字符串文字生成 HTML 元素的 jQuery 代码。您可能需要使用 slice 获取该字符串文字r.text[27:-2]
,然后使用它encode().decode('unicode_escape')
来获取 BeatifulSoup 可以解析的字符串。
import urllib
import urllib.parse
import requests
from bs4 import BeautifulSoup as Soup
URL = 'https://www.goodreads.com/book/reviews/4981?authenticity_token=vxZvklgqILI3SBwtJLDN5DicJKt93LiOWxYwFa%2BrWDdsJxTTAs46WvPN3L1PKNW3qpmacr%2BnWYXexhR%2BfoB3Cw%3D%3D&amp;hide_last_page=true&amp;language_code=en&amp;page=4'
s = 'FUvf1v6N9TgtBKVmo5I3YLm3yVwb//WU9zZDdj1oWd3GeqSXpGnv0OmBZfbICi8zK7J3hdmEFJ9y5mcd7EN24Q=='
PARAMS = {'asynchronous':True,
'evalScripts':True,
'method':'get',
'parameters':'authenticity_token=' + urllib.parse.quote(s.encode("utf-8"))
}
r = requests.get(url = URL, params = PARAMS)
soup = Soup(r.text.encode('utf-8'), 'html.parser')
html_str = r.text[27:-2].encode().decode('unicode_escape')
soup = Soup(html_str, "html.parser")
print(soup)
推荐阅读
- php - why the name capture group does not capture the same value?
- java - How can I make a ThreadPoolExecutor use a deque?
- node.js - 安装 Angular 时的权限问题
- redux - 如何并行调用相同的史诗?
- reactjs - React Native card images doesn't render
- c# - 如何将具有对象列表作为参数的对象序列化为查询字符串格式?
- maven - “UNABLE_TO_VERIFY_LEAF_SIGNATURE”当我的 TFS 使用“准备分析配置”任务构建时
- go - 编写一个函数以从映射中获取字符串键的片段,无论映射是什么值类型
- c++ - 如何按某些属性对列表进行排序(列表中只有一个条目)C++
- django - 如何在 graphql 中从 {} 获取数据