首页 > 解决方案 > 从脚本中抓取网页

问题描述

我正在尝试使用 python 提取公司使用的语言比例BeautifulSoup

然而,信息似乎来自脚本,而不是来自 HTML,我遇到了一些麻烦。

例如,从下一页,当我尝试

webpage ="https://www.zippia.com/amazon-com-careers-487/"
page = requests.get(webpage)
soup = BeautifulSoup(page.content, 'lxml')

for links in soup.find_all('div', {'class':'companyEducationDegrees'}):
    raw_text = links.get_text()
    lines = raw_text.split('\n')
    print(lines)
    print('-------------------')

我没有得到任何结果,而理想的结果应该是Spanish 61.1%, French 9,7%, etc

标签: pythonpython-3.x

解决方案


正如您已经发现的那样,数据是通过 JS 放入页面的。但是,您仍然可以获取该数据,因为公司上的全部数据始终与页面一起加载。requests您可以通过++ (+ BeautifulSoup) 访问此数据:jsonre

import json
import re

import requests
from bs4 import BeautifulSoup

webpage = "https://www.zippia.com/amazon-com-careers-487/"
page = requests.get(webpage)
soup = BeautifulSoup(page.content, 'lxml')

for script in soup.find_all('script', {'type': 'text/javascript'}):
    if 'getCompanyInfo' in script.text:
        match = re.search("{[^\n]*}", script.text)
        data = json.loads(match.group())
        print(data["companyDiversity"]["languages"])

        json.dump(data, open("test.json", "w"), indent=2) # Only if you want the data put in a readable format to a file (like if you want to find the path to an entry)

推荐阅读