python - Indeed Job Scraper 仅适用于带有外部链接的帖子
问题描述
目前使用下面的 Python 爬虫来提取职位、公司、薪水和描述。寻找一种更进一步的方法,即仅过滤应用程序链接为公司网站 URL 的结果,而不是通过 Indeed 发送应用程序的“轻松申请”帖子。有没有办法做到这一点?
import requests
from bs4 import BeautifulSoup
import pandas as pd
def extract(page):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'}
url = f'https://www.indeed.com/jobs?q=Software%20Engineer&l=Austin%2C%20TX&ts=1630951951455&rq=1&rsIdx=1&fromage=last&newcount=6&vjk=c8f4815c6ecfa793'
r = requests.get(url, headers) # 200 is OK, 404 is page not found
soup = BeautifulSoup(r.content, 'html.parser')
return soup
# <span title="API Developer"> API Developer </span>
def transform(soup):
divs = soup.find_all('div', class_ = 'slider_container')
for item in divs:
if item.find(class_ = 'label'):
continue # need to fix, if finds a job that has a 'new' span before the title span, skips job completely
title = item.find('span').text.strip()
company = item.find('span', class_ = "companyName").text.strip()
description = item.find('div', class_ = "job-snippet").text.strip().replace('\n', '')
try:
salary = item.find('span', class_ = "salary-snippet").text.strip()
except:
salary = ""
job = {
'title': title,
'company': company,
'salary': salary,
'description': description
}
jobList.append(job)
# print("Seeking a: "+title+" to join: "+company+" paying: "+salary+". Job description: "+description)
return
jobList = []
# go through multiple pages
for i in range(0,100, 10): #0-40 stepping in 10's
print(f'Getting page, {i}')
c = extract(0)
transform(c)
print(len(jobList))
df = pd.DataFrame(jobList)
print(df.head())
df.to_csv('jobs.csv')
解决方案
我的方法如下-
href
从初始页面上每个工作卡的标签中找到<a>
,然后向每个链接发送请求,然后从那里获取外部工作链接(如果“在公司网站上申请”按钮可用)。
代码片段-
#function which gets external job links
def get_external_link(url):
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36'}
r = requests.get(url, headers)
soup = BeautifulSoup(r.content, 'html.parser')
#if Apply On Company Site button is available, fetch the link
if(soup.find('a',attrs={"referrerpolicy" : "origin"})) is not None:
external_job_link=soup.find('a',attrs={"referrerpolicy" : "origin"})
print(external_job_link['href'])
#add this piece of code to transform function
def transform(soup):
cards=soup.find('div',class_='mosaic-provider-jobcards')
links=cards.find_all("a", class_=lambda value: value and value.startswith("tapItem"))
#for each job link in the page call get_external_links
for link in links:
get_external_link('https://www.indeed.com'+(link['href']))
注意-您还可以使用正在调用的新请求的页面源来获取您以前从主页上抓取的标题、公司、薪水、描述等数据。
推荐阅读
- .net - .NET 和 .NET Framework 项目类型有什么区别?
- webrtc - WebRTC 一对多屏幕共享
- jquery - 为什么 $.ajax 函数被调用两次作为页内脚本?
- javascript - 获取起始字符和结束字符之间的字符串
- css - Bootstrap:根据视口调整大小
- sql-server - 嵌套 IIF 语句问题
- git - git reset --hard 不会恢复更改
- android - 使用flutter通过id从firebase中删除数据
- javascript - 来自 Fedex 的发布响应在 Netsuite 中剥离了 XML 标记
- c++ - 如何从二维数组一起打印 int 和 char