首页 > 解决方案 > 仅从我尝试使用 Python、BeautifulSoup、Requests 抓取的网站获取 JSON 的一部分。从 62 条回复中获得 20 条回复

问题描述

我正在尝试在此网站上搜索职位空缺:

https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/?q=&o=postedDateDesc&w=&wc=&we=&wpst=

我查看了开发工具,发现该页面向该站点发出 XHR 请求,以检索 JSON 对象形式的职位空缺信息:

https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults

所以我喜欢“太好了!我可以使用这样的 python 程序在两秒钟内解析这个”:

''' from bs4 import BeautifulSoup import json import requests

def crawl():
    union = requests.get('https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults').content
    soup = BeautifulSoup(union, 'html.parser')
    newDict = json.loads(str(soup))
    for job in newDict['opportunities']:
        print(job['Title'])

crawl() '''

事实证明,这个页面只返回了 62 个职位空缺中的 20 个。所以我回到页面并加载了整个页面(点击“查看更多机会”)

它说它向同一个链接发送了另一个 XHR 请求,但当我查看时只显示 20 条记录。

我怎样才能从这个页面刮掉所有的记录?如果有人能解释幕后发生的事情,那就太好了。我对网络抓取有点陌生,所以任何见解都值得赞赏。

标签: pythonjsonweb-scrapingbeautifulsoupxmlhttprequest

解决方案


您不需要进行抓取,就像您说返回所有 json 的 API 是链接 https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults 但您需要在body request这个参数中设置

import requests

headers = {
    'Content-Type': 'application/json'
}

data = '{\n  "opportunitySearch": {\n    "Top": 62,\n    "Skip": 0,\n    "QueryString": "",\n    "OrderBy": [\n      {\n        "Value": "postedDateDesc",\n        "PropertyName": "PostedDate",\n        "Ascending": false\n      }\n    ],\n    "Filters": [\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 4,\n        "extra": null,\n        "values": [\n          \n        ]\n      },\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 5,\n        "extra": null,\n        "values": [\n          \n        ]\n      },\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 6,\n        "extra": null,\n        "values": [\n          \n        ]\n      }\n    ]\n  },\n  "matchCriteria": {\n    "PreferredJobs": [\n      \n    ],\n    "Educations": [\n      \n    ],\n    "LicenseAndCertifications": [\n      \n    ],\n    "Skills": [\n      \n    ],\n    "hasNoLicenses": false,\n    "SkippedSkills": [\n      \n    ]\n  }\n}'

response = requests.post('https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults', headers=headers, data=data)
print(response.text)

在这里使用熊猫(pip install pandas)

import requests
import pandas as pd
pd.set_option('display.width', 1000)

headers = {
    'Content-Type': 'application/json'
}

data = '{\n  "opportunitySearch": {\n    "Top": 62,\n    "Skip": 0,\n    "QueryString": "",\n    "OrderBy": [\n      {\n        "Value": "postedDateDesc",\n        "PropertyName": "PostedDate",\n        "Ascending": false\n      }\n    ],\n    "Filters": [\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 4,\n        "extra": null,\n        "values": [\n          \n        ]\n      },\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 5,\n        "extra": null,\n        "values": [\n          \n        ]\n      },\n      {\n        "t": "TermsSearchFilterDto",\n        "fieldName": 6,\n        "extra": null,\n        "values": [\n          \n        ]\n      }\n    ]\n  },\n  "matchCriteria": {\n    "PreferredJobs": [\n      \n    ],\n    "Educations": [\n      \n    ],\n    "LicenseAndCertifications": [\n      \n    ],\n    "Skills": [\n      \n    ],\n    "hasNoLicenses": false,\n    "SkippedSkills": [\n      \n    ]\n  }\n}'

response = requests.post('https://recruiting.ultipro.com/UNI1029UNION/JobBoard/74c2a308-3bf1-4fb1-8a83-f92fa61499d3/JobBoardView/LoadSearchResults', headers=headers, data=data)
data=response.json()
df=pd.DataFrame.from_dict(data['opportunities'])
df= df[['Id','Title','RequisitionNumber','JobCategoryName','PostedDate']]
print(df.head(5))

数据具有“TOP” 62 的地方限制了您的结果:

{
  "opportunitySearch": {
    "Top": 62,
    "Skip": 0,
    "QueryString": "",
    "OrderBy": [
      {
        "Value": "postedDateDesc",
        "PropertyName": "PostedDate",
        "Ascending": false
      }
    ],
    "Filters": [
      {
        "t": "TermsSearchFilterDto",
        "fieldName": 4,
        "extra": null,
        "values": [

        ]
      },
      {
        "t": "TermsSearchFilterDto",
        "fieldName": 5,
        "extra": null,
        "values": [

        ]
      },
      {
        "t": "TermsSearchFilterDto",
        "fieldName": 6,
        "extra": null,
        "values": [

        ]
      }
    ]
  },
  "matchCriteria": {
    "PreferredJobs": [

    ],
    "Educations": [

    ],
    "LicenseAndCertifications": [

    ],
    "Skills": [

    ],
    "hasNoLicenses": false,
    "SkippedSkills": [

    ]
  }
}

推荐阅读