python - 仅从我尝试使用 Python、BeautifulSoup、Requests 抓取的网站获取 JSON 的一部分。从 62 条回复中获得 20 条回复
问题描述
我正在尝试在此网站上搜索职位空缺:
我查看了开发工具,发现该页面向该站点发出 XHR 请求,以检索 JSON 对象形式的职位空缺信息:
所以我喜欢“太好了!我可以使用这样的 python 程序在两秒钟内解析这个”:
''' from bs4 import BeautifulSoup import json import requests
def crawl():
union = requests.get('https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults').content
soup = BeautifulSoup(union, 'html.parser')
newDict = json.loads(str(soup))
for job in newDict['opportunities']:
print(job['Title'])
crawl() '''
事实证明,这个页面只返回了 62 个职位空缺中的 20 个。所以我回到页面并加载了整个页面(点击“查看更多机会”)
它说它向同一个链接发送了另一个 XHR 请求,但当我查看时只显示 20 条记录。
我怎样才能从这个页面刮掉所有的记录?如果有人能解释幕后发生的事情,那就太好了。我对网络抓取有点陌生,所以任何见解都值得赞赏。
解决方案
您不需要进行抓取,就像您说返回所有 json 的 API 是链接 https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults 但您需要在body request这个参数中设置
import requests
headers = {
'Content-Type': 'application/json'
}
data = '{\n "opportunitySearch": {\n "Top": 62,\n "Skip": 0,\n "QueryString": "",\n "OrderBy": [\n {\n "Value": "postedDateDesc",\n "PropertyName": "PostedDate",\n "Ascending": false\n }\n ],\n "Filters": [\n {\n "t": "TermsSearchFilterDto",\n "fieldName": 4,\n "extra": null,\n "values": [\n \n ]\n },\n {\n "t": "TermsSearchFilterDto",\n "fieldName": 5,\n "extra": null,\n "values": [\n \n ]\n },\n {\n "t": "TermsSearchFilterDto",\n "fieldName": 6,\n "extra": null,\n "values": [\n \n ]\n }\n ]\n },\n "matchCriteria": {\n "PreferredJobs": [\n \n ],\n "Educations": [\n \n ],\n "LicenseAndCertifications": [\n \n ],\n "Skills": [\n \n ],\n "hasNoLicenses": false,\n "SkippedSkills": [\n \n ]\n }\n}'
response = requests.post('https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults', headers=headers, data=data)
print(response.text)
在这里使用熊猫(pip install pandas)
import requests
import pandas as pd
pd.set_option('display.width', 1000)
headers = {
'Content-Type': 'application/json'
}
data = '{\n "opportunitySearch": {\n "Top": 62,\n "Skip": 0,\n "QueryString": "",\n "OrderBy": [\n {\n "Value": "postedDateDesc",\n "PropertyName": "PostedDate",\n "Ascending": false\n }\n ],\n "Filters": [\n {\n "t": "TermsSearchFilterDto",\n "fieldName": 4,\n "extra": null,\n "values": [\n \n ]\n },\n {\n "t": "TermsSearchFilterDto",\n "fieldName": 5,\n "extra": null,\n "values": [\n \n ]\n },\n {\n "t": "TermsSearchFilterDto",\n "fieldName": 6,\n "extra": null,\n "values": [\n \n ]\n }\n ]\n },\n "matchCriteria": {\n "PreferredJobs": [\n \n ],\n "Educations": [\n \n ],\n "LicenseAndCertifications": [\n \n ],\n "Skills": [\n \n ],\n "hasNoLicenses": false,\n "SkippedSkills": [\n \n ]\n }\n}'
response = requests.post('https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults', headers=headers, data=data)
data=response.json()
df=pd.DataFrame.from_dict(data['opportunities'])
df= df[['Id','Title','RequisitionNumber','JobCategoryName','PostedDate']]
print(df.head(5))
数据具有“TOP” 62 的地方限制了您的结果:
{
"opportunitySearch": {
"Top": 62,
"Skip": 0,
"QueryString": "",
"OrderBy": [
{
"Value": "postedDateDesc",
"PropertyName": "PostedDate",
"Ascending": false
}
],
"Filters": [
{
"t": "TermsSearchFilterDto",
"fieldName": 4,
"extra": null,
"values": [
]
},
{
"t": "TermsSearchFilterDto",
"fieldName": 5,
"extra": null,
"values": [
]
},
{
"t": "TermsSearchFilterDto",
"fieldName": 6,
"extra": null,
"values": [
]
}
]
},
"matchCriteria": {
"PreferredJobs": [
],
"Educations": [
],
"LicenseAndCertifications": [
],
"Skills": [
],
"hasNoLicenses": false,
"SkippedSkills": [
]
}
}
推荐阅读
- jquery - 如何在选中复选框时禁用文本框并在未选中复选框时启用它们
- mysql - 如果我在 mysql 上无法访问 root,有什么方法可以获取我的 WP 数据库?
- mysql - 查找最接近指定日期时间的记录
- python - 如何在每个单独的 matshow 子图上显示颜色条
- javascript - 为什么明确定义时会得到未定义的“toUpperCase”?
- azure - 链接服务是否支持 Azure 数据工厂中的动态 json?
- unicode - CJK 字符元素的代码点重复?
- unity3d - 将 Intel-Media SDK 与 Unity 集成
- amazon-ec2 - Kubernetes 中断,pod 刚刚消失,拒绝启动
- batch-file - 如何运行存储在变量中的 .exe 文件?