首页 > 解决方案 > 无法从网页中抓取所有公司名称

问题描述

我正在尝试解析此网页中的所有公司名称。那里有周围的2431公司。但是,我在下面尝试的方式可以让我1000得到结果。

这是我在通过开发工具时可以看到的关于响应结果的数量:

hitsPerPage: 1000
index: "YCCompany_production"
nbHits: 2431      <------------------------       
nbPages: 1
page: 0

如何使用请求获得其余结果?

到目前为止我已经尝试过:

import requests

url = 'https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries?'

params = {
    'x-algolia-agent': 'Algolia for JavaScript (3.35.1); Browser; JS Helper (3.1.0)',
    'x-algolia-application-id': '45BWZJ1SGC',
    'x-algolia-api-key': 'NDYzYmNmMTRjYzU4MDE0ZWY0MTVmMTNiYzcwYzMyODFlMjQxMWI5YmZkMjEwMDAxMzE0OTZhZGZkNDNkYWZjMHJlc3RyaWN0SW5kaWNlcz0lNUIlMjJZQ0NvbXBhbnlfcHJvZHVjdGlvbiUyMiU1RCZ0YWdGaWx0ZXJzPSU1QiUyMiUyMiU1RCZhbmFseXRpY3NUYWdzPSU1QiUyMnljZGMlMjIlNUQ='
}
payload = {"requests":[{"indexName":"YCCompany_production","params":"hitsPerPage=1000&query=&page=0&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="}]}

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
    r = s.post(url,params=params,json=payload)
    print(len(r.json()['results'][0]['hits']))

标签: pythonpython-3.xweb-scrapingpython-requests

解决方案


作为一种解决方法,您可以使用字母作为搜索模式来模拟搜索。使用下面的代码,您将获得所有 2431 家公司作为字典,其中 ID 作为键,完整的公司数据字典作为值。

import requests
import string

params = {
    'x-algolia-agent': 'Algolia for JavaScript (3.35.1); Browser; JS Helper (3.1.0)',
    'x-algolia-application-id': '45BWZJ1SGC',
    'x-algolia-api-key': 'NDYzYmNmMTRjYzU4MDE0ZWY0MTVmMTNiYzcwYzMyODFlMjQxMWI5YmZkMjEwMDAxMzE0OTZhZGZkNDNkYWZjMHJl'
                         'c3RyaWN0SW5kaWNlcz0lNUIlMjJZQ0NvbXBhbnlfcHJvZHVjdGlvbiUyMiU1RCZ0YWdGaWx0ZXJzPSU1QiUyMiUy'
                         'MiU1RCZhbmFseXRpY3NUYWdzPSU1QiUyMnljZGMlMjIlNUQ='
}

url = 'https://45bwzj1sgc-dsn.algolia.net/1/indexes/*/queries'
result = dict()
for letter in string.ascii_lowercase:
    print(letter)

    payload = {
        "requests": [{
            "indexName": "YCCompany_production",
            "params": "hitsPerPage=1000&query=" + letter + "&page=0&facets=%5B%22top100%22%2C%22isHiring%22%2C%22nonprofit%22%2C%22batch%22%2C%22industries%22%2C%22subindustry%22%2C%22status%22%2C%22regions%22%5D&tagFilters="
        }]
    }

    r = requests.post(url, params=params, json=payload)
    result.update({h['id']: h for h in r.json()['results'][0]['hits']})

print(len(result))

推荐阅读